Empower Data-Intensive Computing: the integrated data management approach



Introduction

From the computer system point of view there are two types of digital data: observational data, the data collected by electrical devices such as sensor, monitor, camera, text, etc.; and simulation data, data generated by computing. The former represents newly emerged internet data-driven applications, such as social media and data analytic; and the latter represents the conventional computing-driven applications, such as climate modeling and computational fluid dynamics. In general, the latter requires strong consistency for correctness and the former does not. The difference in consistency leads to two kinds of file systems: data-intensive distributed file system, represented by the MapReduce-based Hadoop distributed file systems (HDFS); and computing-intensive file systems, represented by the high performance parallel file systems (PFS), such as the IBM general parallel file system (GPFS). These two kinds of file systems are designed with different philosophies, for different applications, and do not talk to each other. Understanding huge amounts of collected data depends on powerful computation, whereas large-scale computation requires the management of large data. Therefore, big data applications demand an integrated solution. The integrated data access system (IDAS) developed under this research is designed to bridge the data management gap.

Motivation

In agreement with the CAP theory in the distributed system design, the IDAS approach is not designed as a new standalone system but as a software layer which provides an integrated interface to conduct cross-platform data access, from HDFS to PFS, or from PFS to HDFS, read or write, effectively and interchangeably without changing the users' applications.

Application Example

A cloud-resolving model (CRM) is an atmospheric numerical model that can resolve clouds and cloud systems at very high spatial resolution. We are developing the Super Cloud Library (SCL), capable of CRM database management (IO control and compression), distribution, visualization, subsetting, and evaluation. SCL will be built on the NCCS Discover system, which directly stores various CRM simulations, including the NASA-Unified Weather Research and Forecasting (NU-Forecast (WRF) and Goddard Cumulus Ensemble (GCE) models in PFS. SCL architecture is built upon a Hadoop/Spark framework. The Hadoop distributed file system (HDFS) is a stable, distributed, scalable and portable file-system, which is used to store earth science data for data analytics in NCCS. We developed two tools to accelerate the analysis procedure. Dynamic Hadoop Reader enables Hadoop-based application to transparently access and process remote PFS-resided data. We proposed a novel strategy using RHadoop and SparkR to enable diagnoses, sub-setting and visualization in Hadoop and Spark framework respectively.

Publications

  • X. Yang, S. Liu, K. Feng, S. Zhou, and, X.-H. Sun, "Visualization and Adaptive Subsetting of Earth Science Data in HDFS - A Novel Data Analysis Strategy with Hadoop and Spark," in Proc. the 6th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2016), Atlanta, GA, Oct. 2016 (accepted)

  • X. Yang, N. Liu, B. Feng, X.-H. Sun and S. Zhou, "PortHadoop: Support Direct HPC Data Processing in Hadoop," in Proc. of IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA, Oct. 2015. (acceptance rate: 17%)

  • S. Zhou, X. Yang, X. Li, T. Matsui, S. Liu, X.-H. Sun and W. Tao, "A Hadoop-Based Visualization and Diagnosis Framework for Earth Science Data," in Proc. of Big Data in the Geosciences Workshop, in conjunction with IEEE International Conference on Big Data (IEEE BigData 2015) (short paper). Santa Clara, CA, USA, Oct. 2015.

  • X. Yang, Y. Yin, H. Jin, and X.-H. Sun, "SCALER: Scalable Parallel File Write in HDFS," in Proc. of IEEE International Conference on Cluster Computing 2014 (Cluster'14), Madrid, Spain, Sept. 2014.


  • Contact:

    Xian-He Sun
    Department of Computer Science
    Illinois Institute of Technology
    Chicago, IL 60616
    sun@iit.edu

    Back to SCS Home Page