Dr. Zhiling Lan

Assistant Professor
Department of Computer Science

Illinois Institute of Technology

Time : Monday, October 6 th, 11:00 am

Location: SB233

 

Reliability Research for High Performance Computing

 

Abstract

As the scale of high performance computing (HPC) continues to grow, reliability is becoming a critical concern. Recent studies have pointed out that the MTBF (mean-time-between-failures) of teraflop and petaflop machines are only on the order of 10-100 hours. This situation is only likely to deteriorate in the near future, thereby threatening the promising productivity of HPC systems. In this talk, I will discuss my research projects that aim to address the problem from two aspects: (1) pre-failure prediction and tolerance and (2) post-failure diagnosis and recovery. Specially, my work on pre-failure prediction and tolerance is centered upon building FENCE, a Fault-aware ENabled Computing Environment. The core of FENCE is to adaptively integrate proactive and reactive methods with the goal to avoid anticipated failures if possible, and in the case of unforeseeable failures, to minimize their impact. My work on post-failure diagnosis and recovery focuses on designing RAPS (Recovery Aware Parallel computing Systems) to quickly and effectively resume parallel computing in the presence of failures. Our ultimate goal is to seamlessly integrate post-failure diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution for high performance computing

Short Bio

ZHILING LAN is an assistant professor of Computer Science at Illinois Institute of Technology. She received her BS in Mathematics from Beijing Normal University, her MS in Applied Mathematics from the Chinese Academy of Sciences, and her PhD degree in Computer Engineering from Northwestern University. Her research interests are in the area of parallel and distributed systems, in particular, fault tolerant computing, dynamic load balancing, and performance analysis and modeling.

 

BACK