Selected Publications:
- Y. Yu, D. Rudd, Z. Lan, N. Gnedin, A. Kravtsov, and J. Wu,
"Improving Parallel IO Performance of Cell-based AMR Cosmology Applications",
to appear in the Proc. of IPDPS'12, 2012.
- W. Tang, N. Desai, V. Vishwanath, D. Buettner, and Z. Lan,
"Multi-Domain Job Coscheduling for Leadership Computing Systems",
Journal of Supercomputing, 2011. [link]
- J. Wu, R. Gonzalez, Z. Lan, N. Gnedin, A. Kravtsov, D. Rudd, and Y. Yu,
"Performance Emulation of Cell-based AMR Cosmology Simulations",
Proc. of Cluster'11, 2011. [PDF]
- L. Yu, Z. Zheng, Z, Lan, and S. Coghlan,
"Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-Driven",
Proc. of Proactive Failure Avoidance, Recovery, and Maintenance Workshop
(PFARM)
(in conjunction with DSN'11), 2011. [PDF]
- W. Tang, Z. Lan, N. Desai, D. Buettner, Y. Yu,
"Walltime-Aware Spatial Job Scheduling on Blue Gene/P Systems",
Proc. of IPDPS'11, 2011. [PDF]
- Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S.
Coghlan, and D. Buettner,
"Co-Analysis of RAS Log and Job Log on Blue Gene/P",
Proc. of IPDPS'11, 2011. [PDF]
- Y. Li and Z. Lan,
"FREM: A Fast Restart Mechanism for General Checkpoint/Restart",
to appear in the IEEE Trans. on Computers, 2010.
- Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P.
beckman,
"A Practical Failure Prediction with Location and Lead Time for Blue Gene/P",
Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale
(FTXS), in conjunction with DSN'10, 2010. [PDF]
- Z. Lan, J. Gu, Z. Zheng, R.
Thakur, and S. Coghlan,
"A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale
Systems"
Journal of Parallel and Distributed Computing (JPDC),
2010. [PDF]
- W. Tang, N. Desai, D. Buettner, and Z. Lan,
"Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P",
Proc. of IPDPS'10, 2010. [Best Paper Award]
[PDF]
- Z. Lan, Z. Zheng, and Y. Li,
"Toward Automated Anomaly Identification in Large-Scale Systems",
IEEE Trans. on Parallel and Distributed Systems, 21(2), pp. 174-187, 2010.
[PDF]
- Z. Zheng and Z. Lan,
"Reliability-Aware Scalability Models for High Performance Computing",
Proc. of IEEE Cluster'09, 2009. [PDF]
- W. Tang, Z. Lan, N. Desai,
and D. Buettner,
"Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems",
Proc. of IEEE Cluster'09, 2009. [PDF]
- Y. Li, Z. Lan, P. Gujrati, and X. Sun,
"Fault-Aware Runtime Strategies for High Performance Computing",
IEEE Trans. on Parallel and Distributed Systems , vol.
20(4), pp. 460-473, 2009. [PDF]
- Z. Zheng, Z. Lan,
B-H. Park, and A. Geist,
"System Log Pre-processing to Improve Failure Prediction",
Proc. of DSN'09, 2009. [PDF]
- H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie,
"Performance under Failures of DAG-based Parallel Computing",
Proc. of CCGrid'09, 2009. [PDF]
- B-H. Park, Z. Zheng, Z. Lan, and A. Geist,
"Analyzing Failure Events on ORNL's Cray XT4",
Proc. of SC'08 (research poster), 2008.
- J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B-H. Park,
"Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study",
Proc. of ICPP'08 , 2008. [PDF]
- Y. Li and Z. Lan,
"A Fast Recovery Mechanism for Checkpointing in Networked Environments",
Proc. of DSN'08, 2008. [PDF]
- Z. Lan and Y. Li,
"Adaptive Fault Management of Parallel Applications for High Performance Computing",
IEEE Trans. on Computers ,vol. 57(12), pp. 1647-1660,
2008. [PDF]
- Z. Lan, Y. Li, Z. Zheng, and P. Gujrati,
"Enhancing Application Robustness through Adaptive Fault Tolerance",
Proc. of the NSFNGS Workshop (in conjunction with IPDPS'08), 2008. [PDF]
- X. Sun, Z. Lan, Y. Li, H. Jin, and Z. Zheng,
"Towards a Fault-Aware Computing Environment",
Proc. of High Availability and Performance Computing Workshop,, 2008.
- Z. Zheng, Y. Li, and Z. Lan,
"Anomaly Localization in Large-scale Clusters",
Proc. of IEEE Cluster'07, 2007. [PDF]
- P. Gujrati, Y. Li, Z. Lan, R. Thakur, and
J. White,"Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters",
Proc. of ICPP'07 , 2007. [PDF]
- Y. Li, P. Gujrati, Z. Lan, and X. Sun,
"Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience",
Proc. of ICPP'07 , 2007. [PDF]
- Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R.
Thakur, and J. White, "A Fault Diagnosis and Prognosis Service for TeraGrid Clusters",
Proc. of TeraGrid'07 , 2007. [PDF]
- Y. Li and Z. Lan,
"Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid",
Proc. of TeraGrid'07 , 2007. [PDF]
- Y. Li and Z. Lan,
"Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing",
Proc. of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid06),
2006. [PDF]
- Z. Lan and Y. Li,
"Failure-Aware Resource Selection for Grid Computing",
Proc. of IEEE Conference on Dependable Systems and networks (Fast Abstract) ,
2006. [PDF]
- Y. Li and Z. Lan,
"Improving Fault Resilience of High Performance Applications",
Research Poster at SC06 ,2006. [PDF]
-