RAPS: Recovery Aware Parallel computing Systems

With rapid advances in processing technology, along with the emerging multi-core processors and specialized co-processors, parallel processing is permeating almost every aspect of our lives, from high-end computing to commodity deployment. Production systems in the next few years are expected to contain hundreds of thousands of processors, with each processor containing dozens of cores. Fueled by the ever-growing scale and complexity of parallel systems, these systems often fail in unpredictable ways. Studies have shown that in production systems, failure rates range from 20 to more than 1000 per year, and depending on root cause of the problem, the system-level mean time to repair (MTTR) ranges from a couple of hours for failures caused by human errors to nearly 100 hours for failures due to hardware problems. Years of intense research have focused on pre-failure prediction and tolerance by predicting failures and taking precaution actions before failure occurrence. Nevertheless, despite research progress on failure prediction, unexpected failures frequently occur in practice, especially in modern parallel systems with unprecedented sizes and complexities. Hence, relying on pre-failure prediction and tolerance alone is insufficient for fault tolerance due to the inevitability of failures. Just as failures need to be carefully avoided and tolerated, post-failure diagnosis and recovery (i.e., a procedure taken after failures) is of equal importance and has a profound impact on almost every aspect of parallel computing.

The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast failure detection and root cause analysis, (2) development of system-wide orchestration for recovery coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and (4) a comprehensive evaluation. The results of this project can significantly improve the productivity of parallel systems

Collaborators:
  • (ANL) Narayan Desai, Daniel Buettner, Rajeev Thakur, Susan Coghlan, Rinku Gupta and Pete Beckman
  • (ORNL) Byung-Hoon Park and Al Geist
  • (SDSC) John White and Eva Hocks


  • Publications:
  • Y. Li and Z. Lan, "FREM: A Fast Restart Mechanism for General Checkpoint/Restart", to appear in the IEEE Trans. on Computers, 2010.
  • Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. beckman, "A Practical Failure Prediction with Location and Lead Time for Blue Gene/P", Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), in conjunction with DSN'10, 2010.
  • Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, "A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems" Journal of Parallel and Distributed Computing (JPDC), 2010. [PDF]
  • W. Tang, N. Desai, D. Buettner, and Z. Lan, "Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P", Proc. of IPDPS'10, 2010. [Best Paper Award]
  • Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems", IEEE Trans. on Parallel and Distributed Systems, Feb., 2010.
  • Z. Zheng and Z. Lan, "Reliability-Aware Scalability Models for High Performance Computing", Proc. of IEEE Cluster'09, 2009.
  • W. Tang, Z. Lan, N. Desai, and D. Buettner, "Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems", Proc. of IEEE Cluster'09, 2009.
  • Y. Li, Z. Lan, P. Gujrati, and X. Sun, "Fault-Aware Runtime Strategies for High Performance Computing", IEEE Trans. on Parallel and Distributed Systems , vol. 20(4), pp. 460-473, 2009.
  • Z. Zheng, Z. Lan, B-H. Park, and A. Geist, "System Log Pre-processing to Improve Failure Prediction", Proc. of DSN'09, 2009. [PDF]
  • Ziming Zheng, Rinku Gupta, Zhiling Lan, and Susan Coghlan, "FTB-enabled Failure Prediction for Blue Gene/P Systems", Proc. of SC'09 (research poster), 2009.
  • Z. Zheng and Z. Lan, "Reliability-Aware Scalability Models for High Performance Computing", Proc. of IEEE Cluster'09, 2009.
  • W. Tang, Z. Lan, N. Desai, and D. Buettner, "Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems", Proc. of IEEE Cluster'09, 2009.
  • Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems", to appear in IEEE Trans. on Parallel and Distributed Systems, 2009.
  • B-H. Park, Z. Zheng, Z. Lan, and A. Geist, "System Log Pre-processing to Improve Failure Prediction", Proc. of DSN'09, 2009.
  • H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie, "Performance under Failures of DAG-based Parallel Computing", Proc. of CCGrid'09, 2009. [PDF]
  • B-H. Park, Z. Zheng, Z. Lan, and A. Geist, "Analyzing Failure Events on ORNL's Cray XT4", Proc. of SC'08 (research poster), 2008.
  • Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments", Proc. of DSN08, , 2008. [PDF]
  • Z. Zheng, Y. Li, and Z. Lan, "Anomaly Localization in Large-scale Clusters", Proc. of IEEE Cluster'07, 2007. [PDF]


  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    This work is supported by US National Science Foundation.