Utilizing Memory Parallelism for High Performance Data Processing While advances in microprocessor design continue to increase computing speed, improvements in data access speed of computing systems lag far behind. At the same time, data-intensive large-scale applications, such as information retrieval, computer animation, and big data analytics are emerging. Data access delay has become the vital performance bottleneck of modern high performance computing (HPC). Memory concurrency exists at each layer of modern memory hierarchies; however, conventional computing systems are primarily designed to improve CPU utilization and have inherent limitations in addressing the critical issue of data movement in HPC. In order to address the data movement bottleneck issues, this project extracts the general principles of parallel memory system by building on the new Concurrent-AMAT metric, a set of fresh theoretical results in data access concurrency, a series of recent successes in parallel I/O optimization, and new technology opportunities.
Empower Data-Intensive Computing: the integrated data management approach From the system point of view, there are two types of data: observational data, the data collected by electrical devices such as sensors, monitors, cameras, etc.; and simulation data, data generated by computing. In general, the latter is used in the traditional scientific high-performance computing (HPC) and requires strong consistency for correctness. The former is popular for newly emerged big data applications and does not require strong consistency. The difference in consistency leads to two kinds of file systems: data-intensive distributed file system, represented by the MapReduce-based Hadoop distributed file systems (HDFS) from Google and Yahoo; and computing-intensive file systems, represented by the high-performance parallel file systems (PFS), such as the IBM general parallel file system (GPFS). These two kinds of file systems are designed with different philosophies, for different applications, and do not talk to each other. They form two separate ecosystems and used by different communities. However, while data-intensive applications become increasingly ubiquitous, understanding huge amounts of collected data starts to require powerful computations; and, in the High-Performance large-scale computations also demand the ability to handle huge amounts of data, not only because HPC generates more data than before but also advanced data-oriented technologies such as visualization and data mining become a part of general scientific computing. Therefore, HPC and big data requirements are merging. We need an integrated solution for data-intensive HPC and High-Performance data analytics (HPDA). In this research, we propose an Integrated Data Access System (IDAS) to bridge the data management gap.
Empowering Data Management, Diagnosis, and Visualization of Cloud-Resolving Models by Cloud Library upon Spark and Hadoop In the age of big data, scientific applications are generating large volume of data, leading to an explosion of requirements and complexity to process these data. In High Performance Computing (HPC), data management is traditionally supported by the Parallel File Systems (PFS), such as Lustre, PVFS2, GPFS, etc. In big data environments, general-purpose analysis frameworks like MapReduce and Spark are popular and highly available with data storage supported by distributed file systems, such as Hadoop Distributed File Systems (HDFS). While HPC becomes more and more data intensive, Hadoop and Spark are becoming increasingly adopted in HPC environments as well. NASA, as one of the leading institute in cloud-resolving research, adopts Hadoop and Spark to analyze the data generated by its large-scale simulations of cloud-resolving models. However, Hadoop and Spark are not designed for HPC machines and they are not taking advantage of any capabilities of the extremely expensive and sophisticated technologies presented in existing supercomputers. They cannot read data from PFS. The data to be processed cannot utilize the merits of PFS and HDFS, and must be copied sequentially between the two file systems. Scientists who want to take advantage of the big data analytics available on Hadoop and Spark must copy data from parallel file systems manually. That is painfully slow process, especially those with terabytes of data. In this research, we propose a framework to enable data management, diagnosis, and visualization of cloud-resolving models in Hadoop and Spark. Our project is part of the NASA Super Cloud project. Please find more details about the Super Cloud project here.
DEP: A Decoupled Execution Paradigm for Data-intensive High-End Computing Large scale applications in critical areas of science and technology have become more and more data intensive. I/O has become a vital performance bottleneck of modern high-end computing (HEC) practices. Conventional HEC execution paradigms, however, are computing-centric. They are designed to utilize CPU performance for computation intensive applications, and have inherent limitations in addressing newly-emerged data access and data management issues of HEC.In this project, we propose an innovative decoupled execution paradigm (DEP) and the notion of separation of computing-intensive and data-intensive operations. The novelty of the DEP execution paradigm is to have two different sets of hardware and software to handle computing- and data-intensive operations separately. In addition to identification and performance optimization of these two kinds of operations, an underlying challenge of DEP is to develop a twin-coupled runtime system which can integrate the two sets of hardware and software seamlessly and effectively, such that users do not need to be aware of the moving from one operation mode to another, and all the optimizations are conducted automatically.
This project is collaborative research with UTT, UIUC, and ANL.
Application-Specific Optimization via Server Push I/O Architecture
As modern multicore architectures put ever more pressure on the sluggish memory systems, computer applications become more and more data intensive. Advanced memory hierarchies and parallel file systems have been developed in recent years. However, they only provide performance for well-formed data streams, and fail to meet a more general demand. I/O has become a crucial performance bottleneck of high-end computing (HEC), especially for data intensive applications. New mechanisms and new I/O architectures need to be developed to solve the ‘I/O-wall’ problem. We propose a new I/O architecture for HEC. Unlike traditional I/O designs where data is stored and retrieved upon request, our architecture is based on an application-specific ‘Server-Push’ approach in which a data access server automatically tailors the I/O system to optimize data retrieval and data layout for best performance. Here, push means that the data access server takes decisions for in time data retrieval and smart data layout. While in time is defined as moving data from its source to destination within a window of time before it is required, smart means to layout data based on application I/O patterns, thereby fully utilizing parallel I/O and parallel file systems. The basic idea of ‘Server-Push’ is trading computing power for accurate I/O prefetching and for optimized data layout and organization. Designing the ‘Server-Push’ system includes rethinking the design at each level of memory hierarchy, from compiler to file system, to runtime system, as well as rigorous performance modeling and implementation verification. The objective of this research is three fold: 1) increasing the fundamental understanding of I/O behavior at both computing client level and file system server level; 2) providing application-specific performance optimizations using caching, prefetching, and layout optimizations based on the understanding of I/O behavior; 3) producing an effective I/O architecture that minimizes I/O access delay. We plan to increase the fundamental understanding of I/O behavior through the study of I/O access pattern identification, prefetching algorithms, data replacement strategy, data distribution strategy and extensive experimental testing. We verify the performance improvement of server-push design for various critical I/O intensive applications by using a combination of simulation and actual implementation in the PVFS2 file system. This is a joint project with Argonne National Laboratory and University of Illinois at Urbana-Champaign.
FENCE: Fault awareness ENabled Computing Environment Modern high-end computing (HEC) systems are powerful but unprecedented complex. Unfortunately, the more complex a system is, the less reliable it is. With the design of HEC systems with thousands of processors, fault tolerance has become a timely important issue. The conventional solution, checkpointing, achieves fault tolerant by periodically saving system snapshots to the disk or memory. With the growing gap between processor speed and data access speed, such a reactive approach can further increase the disparity between sustained performance and peak performance in HEC. Recently, the proactive fault tolerance approach has been proposed for HEC. However, the newly emerged proactive approach relies on accurate prediction of failure, which is hardly achievable in practice. In this project, we introduce a fault awareness enabled computing environment (FENCE) approach for high-end computing. It combines the merits of both the newly emerged proactive fault tolerant and the traditional checkpointing approach.
Multicore Scheduling and Data Access Multicore microprocessor has totally changed the landscape of what we know as (single machine) computing. It brings parallel processing into a single processor at task level. On one hand, it significantly further enlarges the performance gap between data processing and data access. On the other hand, it calls for a rethinking of system design to utilize the potential of multicore architecture. We believe the key of utilizing multicore microprocessors is to reduce data access delay. We rethink the system scheduling and support from the viewpoint of data access under multicore environments. To this end, in this research we focus on design and development of multicore specific system supports and mechanisms to reduce data access delay. Our approach is three fold: special hardware for swift data access, core-aware and context-aware scheduling and prefetching, and integrated I/O handling for data intensive application. In this project, we propose to enhance memory access performance of modern micro-architecture with specialized data prefetching and access scheduling techniques. Data prefetching, which decouples and overlaps data transfer and computation, is widely considered as an effective memory latency hiding technique. Cache misses are a common cause of CPU stalls. Using cache memories effectively enables bridging the performance gap between the processor and memory. To achieve this goal, data prefetching predicts future data accesses of a processor, initiates data fetch early on and brings the data closer to the processor before the processor requests for it. Access scheduling is an effective solution in tackling the memory performance bottleneck too. Memory access scheduling can effectively reorganize data accesses, and thus reduce the average memory access latency and improve memory bandwidth utilization. We have introduced the data access history cache architecture to support dynamic hardware prefetching and smart data management, and developed the core-aware memory access scheduling for multicore systems recently. More exciting results are in the pipeline.
Grid Harvest Service (GHS)
Rapid advancement of communication technology has changed the landscape of computing. New models of computing such as business-on-demand, Web services, peer-to-peer networks, Grid computing and Cloud computing have emerged to harness distributed computing and network resources and provide powerful services. In such computing platforms, resources are shared and are likely to be remote, thereby out of the user's control. Consequently, resource availability to each user varies largely from time to time due to resource sharing, system configuration change, potential software or hardware failure and other factors which are beyond user control. This non-deterministic nature of resource availability raises an outstanding challenge to the newly-emerged models of computing: how to support Quality of Service (QoS) to meet a user's demand? To address this challenge, we propose and develop Grid Harvest Service (GHS), a software solution based on advanced performance modeling, resource management and monitoring , task allocation and scheduling techniques for system support of QoS in shared network environments.
We develop and build different models to evaluate the impact of various resource availabilities and parallel processing on the application performance. For example, the effect of local jobs on the remote application's performance, the effect of resource reservation on local jobs' performance and the impact of system failure on the application's execution time. To support the application of these novel performance models in practice, we implement effective and adaptive performance measurement mechanisms to dynamically monitor resource and application QoS. Based on resource sharing policies and application perspectives of QoS, we design and develop different task allocation strategies and scheduling algorithms to optimize the application completion time or application failure probability. This prototype software system is designed to provide system support of QoS in the most challenging distributed computing environment, Grid computing. GHS is incorporated into current Grid middleware to enable automated task submission, task execution, and data transfer across heterogeneous administration domains. GHS provides end-users with QoS guaranteed service automatically based on the underlying QoS policies. The source code of GHS 1.1 is available online at GHS Web site. GHS 1.1 only considers performance as the measure of QoS. A companion system we developed with GHS is the Network Bandwidth Predictor(NBP) system, an online network performance forecasting system developed based on neural network technology. GHS plus NBP provides a full function task scheduling system for distributed computing. In our current research, we are extending QoS to include reliability and extending GHS to support fault tolerant aware task scheduling.
Workflow Research Project Workflow management is a new area of distributed computing. It shares many common characteristics with business workflow. However, with the management of thousands processes running coordinatedly in a widely distributed, shared network environment, the workflow of distributed computing is much more complex than the conventional business workflow. Workflow supports task scheduling but is more than task scheduling. From the view point of computing service, any user request is a service, which can be decomposed into a series of known basic services. These basic services may have inherent control and data dependence and need to be sent to different servers for service. These servers, in turn, may be shared by many and may not be in the user's administration domain. Workflow management is designed to get the computing service done effectively. The Lattice QCD workflow research project is in the realm of scientific workflows. LQCD computations are very CPU demanding and produce large intermediate data files (in the order of gigabytes). Dedicated clusters are exclusively used for solving LQCD problems. Our goal is to develop a run time environment for managing the LQCD workflows (also called campaigns) that maximizes usage of these computing environments.
Dynamic Virtual Machine Project Dynamic Virtual Machine (DVM) system is a prototype middleware providing applications a secure, stable and specialized computing environment in cyberspace. It encapsulates computational resources in a secure, isolated and customized virtual machine environment, and enables transparent service mobility and service provisioning. In this research, a computing environment is modeled as DVM, an abstract virtual machine, and is incarnated automatically on various virtualization platforms. To migrate a virtual machine, the DVM system needs to collect the runtime states of a VM. The communication should be kept alive and messages in transit should be delivered in order. DVM Project is a continuation and further development of our previous projects. We use our previous work in High Performance Computing Mobility (HPCM) middleware to migrate the running state of the VM. HPCM is originally designed to support runtime states collection and transfer, and migrate a running process in heterogeneous environments. It has been implemented and tested with various applications. We develop a set of communication protocols, which prevent message loss and keep them in order for both point-to-point communication and group communication. The point-to-point communication protocols are implemented in the Scalable Network of Workstations (SNOW) project and tested on PVM platform; the group communication protocols are implemented in MPI-mitten (Enabling Migration Technology in MPI), a user level communication library and tested on MPI-2 platform.
Pervasive Computing The development of human-centered computing environment that combines processors and sensors with network technologies and intelligent software.
DistDLB: Dynamic Load Balancing of Scientific Applications on Parallel and Distributed Systems Large-scale simulation is an important method in scientific and engineering research, as a compliment to theory and experiment, and has been widely used to study complex phenomena in many disciplines. Many large-scale applications are adaptive in that their computational load varies throughout the execution and causes uneven distribution of the workload at run-time. Dynamic load balancing (DLB) of adaptive applications involves in efficiently partitioning of the application and then migrating of excess workload from overloaded processors to underloaded processors during execution.
Different applications have different adaptive characteristics, which may result in the existing DLB schemes/tools not well suited for them. We have been working with scientists from various disciplines with the objective to provide efficient and effective load balancing techniques for their applications. The applications that we are working with include the cosmology application ENZO and the molecular dynamics application GROMACS.
Most existing partition schemes are targeted for homogeneous parallel systems, which are not appropriate for large-scale applications running on heterogeneous distributed systems. Furthermore, the cost entailed by workload migration may consume orders of magnitude more time than the actual partitioning when the excess workload is transferred across geographically distributed machines. In particular, with workload migration, it is critical to take into account that the wide area network (used to connect the geographically distributed sites) performance is dynamic, changing throughout execution, in addition to the resource heterogeneity. To address these problems, we have designed novel data partition and migration techniques that can be applicable to a range of large-scale adaptive applications executed on heterogeneous distributed computing environments.