RAPS Project

RAPS: Recovery Aware Parallel computing Systems

With rapid advances in processing technology, along with the emerging multi-core processors and specialized co-processors, parallel processing is permeating almost every aspect of our lives, from high-end computing to commodity deployment. Production systems in the next few years are expected to contain hundreds of thousands of processors, with each processor containing dozens of cores. Fueled by the ever-growing scale and complexity of parallel systems, these systems often fail in unpredictable ways. Studies have shown that in production systems, failure rates range from 20 to more than 1000 per year, and depending on root cause of the problem, the system-level mean time to repair (MTTR) ranges from a couple of hours for failures caused by human errors to nearly 100 hours for failures due to hardware problems. Years of intense research have focused on pre-failure prediction and tolerance by predicting failures and taking precaution actions before failure occurrence. Nevertheless, despite research progress on failure prediction, unexpected failures frequently occur in practice, especially in modern parallel systems with unprecedented sizes and complexities. Hence, relying on pre-failure prediction and tolerance alone is insufficient for fault tolerance due to the inevitability of failures. Just as failures need to be carefully avoided and tolerated, post-failure diagnosis and recovery (i.e., a procedure taken after failures) is of equal importance and has a profound impact on almost every aspect of parallel computing.

The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast failure detection and root cause analysis, (2) development of system-wide orchestration for recovery coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and (4) a comprehensive evaluation. The project also includes three integrated education activities: (1) recruiting and training of graduate and undergraduate students; (2) enhancing CS curriculum, and (3) providing outreach programs for underrepresented groups.

Collaborators:

(ANL) Narayan Desai, Daniel Buettner, Rajeev Thakur, Susan Coghlan, Rinku Gupta and Pete Beckman

(Sandia) Jim Brandt and Ann Gentile

(ORNL) Terry Jones, Byung-Hoon Park and Al Geist

Software Tools and Public Data Sets:

(Software) SysDP - an automated fault diagnosis and prognosis software toolkit for large-scale systems. [Link]

(Software) QSim - an event-driven job scheduling simulator for Cobalt. [Link]

(Software) CQSim - A Extensible and Scalable Resource Management and Job Scheduling Simulator [link]

(Software) schedshow - an analysis and visualization tool for job scheduling. [Link]

(Data set) a 9-month RAS log collected from the 40-rack production Blue Gene/P system is released and stored at USENIX Computer Failure Data Repository [Link]

(Data set) a 9-month workload trace collected from the 40-rack production Blue Gene/P system is released and stored at Parallel Workloads Archive [Link]

Key Publications:

Z. Zheng, L. Yu, Z. Lan, and T. Jones, "3-Dimensional Root Cause Diagnosis via Co-Analysis", Proc. of ICAC, 2012.

L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile, "Filtering Log Data: Finding the Needles in the Haystack", Proc. of DSN, 2012.

Y. Li and Z. Lan, "FREM: A Fast Restart Mechanism for General Checkpoint/Restart", IEEE Trans. on Computers, 60(5), 639-652, 2010.

Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems", IEEE Trans. on Parallel and Distributed Systems, 21(2), 174-187, 2010.

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. beckman, "A Practical Failure Prediction with Location and Lead Time for Blue Gene/P", Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), in conjunction with DSN'10, 2010.

Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, "A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems" Journal of Parallel and Distributed Computing (JPDC),70(6), 630-643, 2010.

W. Tang, N. Desai, D. Buettner, and Z. Lan, "Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P", Proc. of IPDPS'10, 2010. [Best Paper Award]

W. Tang, Z. Lan, N. Desai, and D. Buettner, "Automatic and Coordinated Job Recovery for High Performance Computing", IEEE Workshop on Many-Task Computing on Grids and Supercomputers, 2010.

Y. Li, Z. Lan, P. Gujrati, and X-H. Sun, "fault-Aware Runtime Strategies for High Performance Computing", IEEE Trans. on Parallel and Distributed Systems, 20(4), 460-473, 2009.

Z. Zheng and Z. Lan, "Reliability-Aware Scalability Models for High Performance Computing", Proc. of IEEE Cluster'09, 2009.

Y. Li, Z. Lan, P. Gujrati, and X. Sun, "Fault-Aware Runtime Strategies for High Performance Computing", IEEE Trans. on Parallel and Distributed Systems , vol. 20(4), pp. 460-473, 2009.

Z. Zheng, Z. Lan, B-H. Park, and A. Geist, "System Log Pre-processing to Improve Failure Prediction", Proc. of DSN'09, 2009. [PDF]

Ziming Zheng, Rinku Gupta, Zhiling Lan, and Susan Coghlan, "FTB-enabled Failure Prediction for Blue Gene/P Systems", Proc. of SC'09 (research poster), 2009.

Z. Zheng and Z. Lan, "Reliability-Aware Scalability Models for High Performance Computing", Proc. of IEEE Cluster'09, 2009.

W. Tang, Z. Lan, N. Desai, and D. Buettner, "Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems", Proc. of IEEE Cluster'09, 2009.

B-H. Park, Z. Zheng, Z. Lan, and A. Geist, "System Log Pre-processing to Improve Failure Prediction", Proc. of DSN'09, 2009.

H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie, "Performance under Failures of DAG-based Parallel Computing", Proc. of CCGrid'09, 2009. [PDF]

B-H. Park, Z. Zheng, Z. Lan, and A. Geist, "Analyzing Failure Events on ORNL's Cray XT4", Proc. of SC'08 (research poster), 2008.

Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments", Proc. of DSN08, , 2008. [PDF]

Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)

This work is supported by US National Science Foundation.