|
|
|
|
Current Projects
-
Dynamic Virtual Machine Project
Dynamic Virtual Machine (DVM) system is a prototype middleware providing applications a secure, stable and specialized computing environment in cyberspace. It encapsulates computational resources in a secure, isolated and customized virtual machine environment, and enables transparent service mobility and service provisioning. In this research, a computing environment is modeled as DVM, an abstract virtual machine, and is incarnated automatically on various virtualization platforms. To migrate a virtual machine, the DVM system needs to collect the runtime states of a VM. The communication should be kept alive and messages in transit should be delivered in order. DVM Project is a continuation and further development of our previous projects. We use our previous work in High Performance Computing Mobility (HPCM) middleware to migrate the running state of the VM. HPCM is originally designed to support runtime states collection and transfer, and migrate a running process in heterogeneous environments. It has been implemented and tested with various applications. We develop a set of communication protocols, which prevent message loss and keep them in order for both point-to-point communication and group communication. The point-to-point communication protocols are implemented in the Scalable Network of Workstations (SNOW) project and tested on PVM platform; the group communication protocols are implemented in MPI-mitten (Enabling Migration Technology in MPI), a user level communication library and tested on MPI-2 platform.
-
Grid Harvest Service (GHS):
Rapid advancement of communication technology has changed the landscape of computing. New models of computing, such as business-on-demand, Web services, peer-to-peer networks, and Grid computing have emerged to harness distributed computing and network resources to provide powerful services. In these new computing platforms resources are shared, and likely are remote and out of the user's control. Consequently, resources availability to each user varies largely from time to time, due to resource sharing, system configuration change, potential software or hardware failure, and other factors beyond the control of a user. This non-deterministic nature of the resource availability raises an outstanding challenge to these newly-emerged models of computing: how to support Quality of Service (QoS) to meet a user's demand? To address this challenge, we propose and develop Grid Harvest Service (GHS), a software solution based on advanced performance modeling, resources management and monitoring, and task allocation and scheduling techniques for system support of QoS in shared network environments.
We develop and build different models to evaluate the impact of various resource availabilities and parallel processing on application performance, such as the effect of local jobs on the remote application's performance, the effect of resource reservation on local jobs' performance, and the impact of system failure on the application's execution time. To support the application of these novel performance models in practice, we implement effective and adaptive performance measurement mechanisms to dynamically monitor resource and application QoS. Based on resource sharing policies and application perspectives of QoS, we design and develop different task allocation strategies and scheduling algorithms to optimize the application completion time or application failure probability. This prototype software system is designed to provide system support of QoS in the most challenging distributed computing environment, Grid computing. GHS is incorporated into current Grid middleware to enable automated task submission, task execution, and data transfer across heterogeneous administration domains. GHS provides end-users with QoS guaranteed service automatically based on the underlying QoS policies. The source code of GHS 1.1 is available online at GHS Web site. GHS 1.1 only considers performance as the measure of QoS. In our current research, we are extending QoS to include reliability.
- FENCE: Fault awareness ENabled Computing Environment
Modern high-end computing (HEC) systems are powerful but unprecedented complex. Unfortunately, the more complex a system is, the less reliable it is. With the design of HEC systems with thousands of processors, fault tolerance has become a timely important issue. The conventional solution, checkpointing, achieves fault tolerant by periodically saving system snapshots to the disk or memory. With the growing gap between processor speed and data access speed, such a reactive approach can further increase the disparity between sustained performance and peak performance in HEC. Recently, the proactive fault tolerance approach has been proposed for HEC. However, the newly emerged proactive approach relies on accurate prediction of failure, which is hardly achievable in practice.
In this project, we introduce a fault awareness enabled computing environment (FENCE) approach for high-end computing. It combines the merits of both the newly emerged proactive fault tolerant and the traditional checkpointing approach.
-
Server-Push Data Access Architecture
Many High-End Computing (HEC) applications are data intensive and their performance suffer from high data access latencies. Data access latency contributes to a major portion of gap between peak performance and sustained system performance of HEC machines. Memory access latency is high and getting a lot of attention from researchers to bridge the gap between processor and memory performance. However, I/O latency is even higher than memory access latency and has been continuously ignored by many researchers. The main reason behind poor I/O performance has again been the gap between the improvements of processor performance and storage performance. Although advanced parallel file systems have been developed in recent years, they provide high bandwidth only for large, well-formed data streams, and perform poorly for accessing small, noncontiguous data. Unfortunately, many HEC applications make a large number of requests for small and noncontiguous pieces of data. While techniques such as collective I/O can be used to merge small I/O requests into large ones, many small I/O requests cannot be eliminated due to the inherent nature of the underlying applications. The I/O wall remains after years of study and is becoming the most important issue in HEC.
With our experience in optimizing data access performance, we propose a new I/O architecture for HEC. Unlike traditional I/O designs where data is stored and retrieved by request, our architecture is based on a novel “Server-Push” model where a data access server, called File Access Server (FAS), proactively “pushes” data in time from a file server to the compute node's memory. Here, “push” means the data is sent before an I/O request is generated by the client; by “in time”, we mean that data is moved from its source to destination within a window of time before it is required, and where it does not replace other data blocks from I/O cache falsely. The objective of this research is two fold: 1) increasing fundamental understanding of data access delay, 2) producing an effective I/O architecture that minimizes I/O latency. We plan to increase the fundamental understanding through the study of data access pattern identification, prefetching algorithms, data replacement strategy, and extensive experimental testing. We will verify the performance improvement with our file server design for various critical I/O intensive applications by using a combination of simulation and actual implementation in the PVFS2 file system. The goal of implementing FAS is a significant reduction in time-to-solution of various I/O intensive scientific and numerical applications, which improves the productivity of High-End Computing.
- Workflow Research Project
Workflow management is a new area of distributed computing. It shares many common characteristics with business workflow. However, with the management of thousands processes running coordinatedly in a widely distributed, shared network environment, the workflow of distributed computing is much more complex than the conventional business workflow. Workflow supports task scheduling but is more than task scheduling. From the view point of computing service, any user request is a service, which can be decomposed into a series of known basic services. These basic services may have inherent control and data dependence and need to be sent to different servers for service. These servers, in turn, may be shared by many and may not be in the user's administration domain. Workflow management is designed to get the computing service done effectively. The Lattice QCD workflow research project is in the realm of scientific workflows. LQCD computations are very CPU demanding and produce large intermediate data files (in the order of gigabytes). Dedicated clusters are exclusively used for solving LQCD problems. Our goal is to develop a run time
environment for managing the LQCD workflows (also called campaigns) that maximizes usage of these computing environments.
- Pervasive Computing:
The development of human-centered computing environment that combines processors and sensors with network technologies and intelligent software.
|
Past Projects
- DistDLB: Dynamic Load Balancing of Scientific Applications on Parallel and
Distributed Systems
Large-scale simulation is an important method in scientific and engineering research, as a compliment to theory and experiment, and has been widely used to study complex phenomena in many disciplines. Many large-scale applications are adaptive in that their computational load varies throughout the execution and causes uneven distribution of the workload at run-time. Dynamic load balancing (DLB) of adaptive applications involves in efficiently partitioning of the application and then migrating of excess workload from overloaded processors to underloaded processors during execution.
Different applications have different adaptive characteristics, which may result in the existing DLB schemes/tools not well suited for them. We have been working with scientists from various disciplines with the objective to provide efficient and effective load balancing techniques for their applications. The applications that we are working with
include the cosmology application ENZO and the molecular dynamics application GROMACS.
Most existing partition schemes are targeted for homogeneous parallel systems, which are not appropriate for large-scale applications running on heterogeneous distributed systems. Furthermore, the cost entailed by workload migration may consume orders of magnitude more time than the actual partitioning when the excess workload is transferred across geographically distributed machines. In particular, with workload migration, it is critical to take into account that the wide area network (used to connect the geographically distributed sites) performance is dynamic, changing throughout execution, in addition to the resource heterogeneity. To address these problems, we have designed novel data partition and migration techniques that can be applicable to a range of large-scale adaptive applications executed on heterogeneous distributed computing environments.
- Highly Accurate PArallel Numerical Simulations (HAPANS): a systematic approach to develop highly accurate parallel numerical simulations for CFD applications.
- High Performance Computing Mobility (HPCM) middleware :
a middleware which supports mobility of legacy code.
- SNOW: A Distributed Dynamic System for Metacomputing
- DOT:
Distributed Optical Testbed to Facilitate the Development of Techniques for Efficient Execution of Distributed Applications.
- Virtual Collaboratory for Numeical Simulatin (VCNS)
is a suite of software systems, communication protocals, and tools that enable computer-based cooperative work. It is a sister project of SNOW.
|
|
|