bootstrap theme

Hermes

Extending the HDF Library to Support
Intelligent I/O Buffering for Deep Memory and Storage Hierarchy Systems
(NSF OCI-1835764)

Abstract

Modern high performance computing (HPC) applications generate massive amounts of data. However, the performance improvement of disk based storage systems has been much slower than that of memory, creating a significant Input/Output (I/O) performance gap. To reduce the performance gap, storage subsystems are under extensive changes, adopting new technologies and adding more layers into the memory/storage hierarchy. With a deeper memory hierarchy, the data movement complexity of memory systems is increased significantly, making it harder to utilize the potential of the deep memory and storage hierarchy (DMSH) design. As we move towards the exascale era, I/O bottleneck is a must to solve performance bottleneck facing the HPC community. DMSHs with multiple levels of memory/storage layers offer a feasible solution but are very complex to use effectively. Ideally, the presence of multiple layers of storage should be transparent to applications without having to sacrifice I/O performance. There is a need to enhance and extend current software systems to support data access and movement transparently and effectively under DMSHs. Hierarchical Data Format (HDF) technologies are a set of current I/O solutions addressing the problems in organizing, accessing, analyzing, and preserving data. HDF5 library is widely popular within the scientific community. Among the high level I/O libraries used in DOE labs, HDF5 is the undeniable leader with 99% of the share. HDF5 addresses the I/O bottleneck by hiding the complexity of performing coordinated I/O to single, shared files, and by encapsulating general purpose optimizations. While HDF technologies, like other existing I/O middleware, are not designed to support DMSHs, its wide popularity and its middleware nature make HDF5 an ideal candidate to enable, manage, and supervise I/O buffering under DMSHs. This project proposes the development of Hermes, a heterogeneous aware, multi-tiered, dynamic, and distributed I/O buffering system that will significantly accelerate I/O performance.

This project proposes to extend HDF technologies with the Hermes design. Hermes is new, and the enhancement of HDF5 is new. The deliveries of this research include an enhanced HDF5 library, a set of extended HDF technologies, and a group of general I/O buffering and memory system optimization mechanisms and methods. We believe that the combination of DMSH I/O buffering and HDF technologies is a reachable practical solution that can efficiently support scientific discovery. Hermes will advance HDF5 core technology by developing new buffering algorithms and mechanisms to support:

  • Vertical and Horizontal Buffering in DMSHs:   here vertical means access data to/from different levels locally and horizontal means spread/gather data across remote compute nodes.
  • Selective Buffering via HDF5:  here selective means some memory layer, e.g. NVMe, only for selected data.
  • Dynamic Buffering via Online System Profiling:   the buffering schema can be changed dynamically based on messaging traffic;
  • Adaptive Buffering via Reinforcement Learning: by learning the application's access pattern, we can adapt prefetching algorithms and cache replacement policies at runtime. The development Hermes will be translated into high quality dependable software and will be released with the core HDF5 library.

Personnel

Principal Investigator:

Dr. Xian-He Sun
PI,Illinois Institute of Technology
(November 2018-present)

Elena Pourmal
Co-PIThe HDF Group
(November 2018-present)

Graduate Students:

Hariharan Devarajan
PhD student SCS Lab, Illinois Institute of Technology
(November 2018-present)

Neeraj Rajesh
PhD student SCS Lab, Illinois Institute of Technology
(May 2019-present) 

Senior Personnel:

Dr. Anthony Kougkas
Lead Researcher, Illinois Institute of Technology
(November 2018-present)

Gerd Heber
Lead Engineer, The HDF Group
(November 2018-present) 

Undergraduate Students:

Keith Bateman
SCS Lab, Illinois Institute of Technology 
(May 2019-present)

Hugo Trivino
SCS Lab, Illinois Institute of Technology
(May 2019-present)

Hermes Overview

Today's multi-tiered environments demonstrate that:

Complex data placement among the tiers of a deep memory and storage hierarchy

  • Lack of automated data movement between tiers, is now left to the users.
  • Lack of intelligent data placement in the DMSH.

Independent management of each tier of the DMSH

  • Lack of expertise from the user.
  • Lack of existing software for managing tiers of heterogeneous buffers.
  • Lack of native buffering support in HDF5.

Deep memory and storage hierarchy (DMSH) systems require:

  • Efficient and transparent data movement through the hierarchy
  • New data placement algorithms,
  • Effective memory and metadata management,
  • An efficient communication fabric.

In this project, we envision a platform that:

1

Is transparent to the application and can easily plug-in to existing workflows.

2

Provides an extended API that enables active buffering for data-intensive operations.

3

Facilitates efficient access to all tiers of the DMSH, both vertically and horizontally.

4

Is feature-rich yet simplistic and lightweight.

Mobirise

A new, multi-tiered, distributed buffering platform that:

  • Enables, manages, and supervises I/O operations in the Deep Memory and Storage Hierarchy (DMSH).
  • Offers selective and dynamic layered data placement.
  • Is modular, extensible, and performance-oriented.
  • Supports a wide variety of applications (scientific, BigData, etc).

Hermes Architecture

Hermes machine model

Large amount of RAM, Local NVMe and/or SSD device, Shared Burst Buffers and Remote disk-based PFS.

Hierarchy based on

 Access Latency
Data Throughput
Capacity.

Two data paths

Vertical ->within node
Horizontal ->across nodes

Mobirise

Hermes Node Design

Mobirise
  • Dedicated core for Hermes
  • RDMA-capable communication
  • Can also be deployed in I/O Forwarding Layer (I/O FL)
  • Node Manager:
  • Dedicated multithreaded core per node
  • MDM
  • Data Organizer
  • Messaging Service
  • Memory management
  • Prefetcher
  • Cache manager

Hermes Design

Mobirise
  •  Middle-ware library written in C++: Link with applications (i.e., re-compile or LD_PRELOAD) and Wrap-around I/O calls.
  • Modular, extensible, performance-oriented.
  • Will support: POSIX, HDF5 and MPI-IO.
  • Hinting mechanism to pass user’s operations.

Hermes goals consist of:

  1.     being application- and system-aware
  2.     maximizing productivity
  3.     increasing resource utilization
  4.     abstracting data movement
  5.     maximizing performance
  6.     supporting a wide range of scientific applications and domains


Design Implications:

Mobirise

Evaluation Results

Hermes Library Evaluation

Mobirise

RAM Management

1 million fwrite() of various size and measured memory ops/sec

Mobirise

Metadata Management

1 million metadata operations and measure MDM throughput ops/sec

Mobirise

Communication

1 million queue operations and measure messaging rate msg/sec 

Workload Evaluation

Mobirise

Alternating Compute-I/O

8x higher write performance on average

Mobirise

Repetitive Read Operations

11x higher read performance for repetitive patterns

Mobirise

VIPC

5x higher write performance on average

Mobirise

HACC

7.5x higher read performance for repetitive patterns

Hermes Contributions

Mobirise

FAQ

That is true. We suggest using profiling tools before hand to learn about the application’s behavior and tune Hermes. Default policy works great.

As of now, applications link to Hermes (re-compile or dynamic linking). We envision a system scheduler that also incorporates buffering resources.

Hermes’ Application Orchestrator was designed for multi-tenant environments. This work is described in Vidya: Performing Code-Block I/O Characterization for Data Access Optimization.

It can be severe but in scenarios where there is some computation in between I/O then it can work nicely to our advantage.

In our evaluation, for 1 million user files, the metadata created were 1.1GB.

Hermes’ System Profiler provides the current status of the system (i.e., remaining capacity, etc) and DPE is aware of this before it places data in the DMSH.

Horizontal data movement can be in the way of the normal compute traffic. RDMA capable machines can help. We also suggest using the “service class” of the Infiniband network to apply priorities in the network.

Configurable by the user. Typical trade-off. More RAM to Hermes can lead to higher performance. No RAM means skip the layer.

 Hermes captures existing I/O calls. Our own API is really simple consisting of hermes::read(…, flags) and hermes::write(…,flags). Flag system implements active buffering semantics (currently only for the burst buffer nodes).

We expose a configuration_manager class which is used to pass several Hermes’ configuration parameters.

Publications

Sponsor

National Science Foundation
(NSF OCI-1835764)

Mobirise
Address

Stuart Building
Room 112i, Room010
10 W. 31st Street
Chicago, Illinois 60616

Contacts

Email: scs-help@cs.iit.edu
Phone: +1 312 567 6885