RAPS: Recovery Aware Parallel computing Systems
With rapid advances in processing technology, along with the emerging
multi-core processors and specialized co-processors, parallel processing
is permeating almost every aspect of our lives, from high-end computing to
commodity deployment. Production systems in the next few years are expected
to contain hundreds of thousands of processors, with each processor containing dozens
of cores. Fueled by the ever-growing scale and complexity of parallel systems,
these systems often fail in unpredictable ways. Studies have shown that in production
systems, failure rates range from 20 to more than 1000 per year, and
depending on root cause of the problem, the system-level mean time to
repair (MTTR) ranges from a couple of hours for failures caused by human
errors to nearly 100 hours for failures due to hardware problems. Years of intense research have focused
on pre-failure prediction and tolerance by predicting failures and taking precaution actions before failure
occurrence. Nevertheless, despite research progress on failure
prediction, unexpected failures frequently occur in practice, especially in modern parallel systems with
unprecedented sizes and complexities. Hence, relying on pre-failure prediction and tolerance alone is
insufficient for fault tolerance due to the inevitability of failures. Just as failures need to be
carefully avoided and tolerated, post-failure diagnosis and recovery (i.e., a procedure taken after
failures) is of equal importance and has a profound impact on almost every aspect of parallel computing.
The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for
post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume
parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure
diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution
for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast
failure detection and root cause analysis, (2) development of system-wide orchestration for recovery
coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and
(4) a comprehensive evaluation. The results of this project can significantly improve the productivity of
parallel systems
Collaborators:
(ANL) Narayan Desai, Daniel Buettner, Rajeev Thakur, Susan Coghlan, Rinku Gupta and Pete Beckman
(ORNL) Byung-Hoon Park and Al Geist
(SDSC) John White and Eva Hocks
Publications:
Y. Li and Z. Lan,
"FREM: A Fast Restart Mechanism for General Checkpoint/Restart",
to appear in the IEEE Trans. on Computers, 2010.
Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P.
beckman,
"A Practical Failure Prediction with Location and Lead Time for Blue Gene/P",
Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale
(FTXS), in conjunction with DSN'10, 2010.
Z. Lan, J. Gu, Z. Zheng, R.
Thakur, and S. Coghlan,
"A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale
Systems"
Journal of Parallel and Distributed Computing (JPDC),
2010. [PDF]
W. Tang, N. Desai, D. Buettner, and Z. Lan,
"Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P",
Proc. of IPDPS'10, 2010. [Best Paper Award]
Z. Lan, Z. Zheng, and Y. Li,
"Toward Automated Anomaly Identification in Large-Scale Systems",
IEEE Trans. on Parallel and Distributed Systems, Feb., 2010.
Z. Zheng and Z. Lan,
"Reliability-Aware Scalability Models for High Performance Computing",
Proc. of IEEE Cluster'09, 2009.
W. Tang, Z. Lan, N. Desai,
and D. Buettner,
"Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems",
Proc. of IEEE Cluster'09, 2009.
Y. Li, Z. Lan, P. Gujrati, and X. Sun,
"Fault-Aware Runtime Strategies for High Performance Computing",
IEEE Trans. on Parallel and Distributed Systems , vol.
20(4), pp. 460-473, 2009.
Z. Zheng, Z. Lan,
B-H. Park,
and A. Geist,
"System Log Pre-processing to Improve Failure Prediction",
Proc. of DSN'09, 2009. [PDF]
Ziming Zheng, Rinku Gupta, Zhiling Lan, and Susan Coghlan,
"FTB-enabled Failure Prediction for Blue Gene/P Systems",
Proc. of SC'09 (research poster), 2009.
Z. Zheng and Z. Lan,
"Reliability-Aware Scalability Models for High Performance Computing",
Proc. of IEEE Cluster'09, 2009.
W. Tang, Z. Lan, N. Desai,
and D. Buettner,
"Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems",
Proc. of IEEE Cluster'09, 2009.
Z. Lan, Z. Zheng, and Y. Li,
"Toward Automated Anomaly Identification in Large-Scale Systems",
to appear in IEEE Trans. on Parallel and Distributed Systems, 2009.
B-H. Park, Z. Zheng, Z. Lan,
and A. Geist,
"System Log Pre-processing to Improve Failure Prediction",
Proc. of DSN'09, 2009.
H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie,
"Performance under Failures of DAG-based Parallel Computing",
Proc. of CCGrid'09, 2009. [PDF]
B-H. Park, Z. Zheng, Z. Lan, and A. Geist,
"Analyzing Failure Events on ORNL's Cray XT4",
Proc. of SC'08 (research poster), 2008.
Y. Li and Z. Lan,
"A Fast Recovery Mechanism for Checkpointing in Networked Environments",
Proc. of DSN08, , 2008. [PDF]
Z. Zheng, Y. Li, and Z. Lan,
"Anomaly Localization in Large-scale Clusters",
Proc. of IEEE Cluster'07, 2007. [PDF]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
This work is supported by US National Science Foundation.