Empower Data-Intensive Computing:

the integrated data management approach

(NSF CNS-1526887)

Abstract:

From the system point of view, there are two types of data: observational data, the data collected by electrical devices such as sensors, monitors, cameras, etc.; and simulation data, data generated by computing. In general, the latter is used in the traditional scientific high-performance computing (HPC) and requires strong consistency for correctness. The former is popular for newly emerged big data applications and does not require strong consistency. The difference in consistency leads to two kinds of file systems: data-intensive distributed file system, represented by the MapReduce-based Hadoop distributed file systems (HDFS) from Google and Yahoo; and computing-intensive file systems, represented by the high-performance parallel file systems (PFS), such as the IBM general parallel file system (GPFS). These two kinds of file systems are designed with different philosophies, for different applications, and do not talk to each other. They form two separate ecosystems and used by different communities. However, while data-intensive applications become increasingly ubiquitous, understanding huge amounts of collected data starts to require powerful computations; and, in thHigh-Performancenced large-scale computations also demand the ability to handle huge amounts of data, not only because HPC generates more data than before but also advanced data-oriented technologies such as visualization and data mining become a part of general scientific computing. Therefore, HPC and big data requirements are merging. We need an integrated solution for data-intensive HPC and High-Performance data analytics (HPDA). In this research, we propose an Integrated Data Access System (IDAS) to bridge the data management gap.

Personnel:

Principal Investigator:

Graduate Students:

Undergraduate Students:


IDAS Framework

Main Contributions of IDAS:

The IDAS framework is implemented and tested. In particular, the MapReduce environment to Parallel File Systems (PFSs) direction is fully developed and implemented under NASA environment for NASA applications under the NASA Super Cloud project. The corresponding software system, named PortHadoop, was developed and tested under the NSF Chameleon Cloud computing facility. PortHadoop is reported by HPC wire and the NSF TACC supercomputing center as a successful research project.

 


IRIS Framework

Main Contributions of IRIS:

Distributed data processing environments have advanced quickly during the last several years. Obviously, supporting only the integration of Hadoop environments with HPC is not good enough now. Currently, we have been working on extending the IDAS idea to support more general integrations and to support the converging of HPC and big data/Cloud environments at the data access level in general. We are developing a new framework, named IRIS, for the extension and continuation.


Publications (some early publications are listed for self-completeness):

Software:

PortHadoop library can be found here.

IRIS library will be available soon.

Sponsor:

National Science Foundation