FENCE: Fault awareness ENabled Computing Environment

As the scale of high performance computing continues to grow, fault management is becoming a critical challenge. Recent studies have pointed out that the MTBF of teraflop and petaflop machines are only on the order of 10-100 hours. This situation is only likely to deteriorate in the near future, thereby threatening the promising productivity of large-scale systems. Checkpointing is the conventional method for fault tolerance. However, it only deals with failures after their occurrence through rollback. In case of one process failure, all processes including non-faulty processes have to be restarted from the previously saved state prior to the failure. Thus, significant performance loss can be incurred due to the work loss and failure recovery. Proactive approaches take preventive actions (e.g. preemptive process migration) before failures, thereby avoiding failures with low cost. Nevertheless, its effectiveness relies on perfect fault prediction, which is hardly achievable in practice.

This project aims at building FENCE, a Fault awareness ENabled Computing Environment for high performance computing. FENCE is "hybrid" by integrating offline analysis and runtime support to enhance fault management. Offline analysis models the possibility of faults based on historical data and consequently facilitates intelligent system configuration and mapping; and runtime support diagnoses runtime events and moves running jobs away from those troublesome resources. FENCE is also "adaptive" by combining the merits of the newly emerged proactive fault tolerant approach and the traditional checkpointing approach. Proactive actions enable applications to avoid anticipated faults if possible, whereas reactive actions intend to minimize the impact of unforeseeable failures. The following figure illustrates the major components of FENCE:

Faculty Members:
  • Zhiling Lan
  • Xian-He Sun

  • Graduate Students:
  • Yawei Li
  • Ziming Zheng
  • Wei Tang
  • Jin Hui
  • Jiexing Gu
  • Prashasta Gujrati
  • Bing Xie

  • Collaborators:
  • Byung-Hoon Park and Al Geist (ORNL)
  • Rajeev Thakur (ANL)
  • John White and Eva Hocks (SDSC)
  • Cobalt Team (ANL)

  • Recent Talks:
  • Z. Zheng, "Failure Prediction with Cray Log", The Cray Log Analysis Contest at the 1st USENIX Workshop on the Analysis of System Logs (WASL'08, co-located with the 8th USENIX OSDI'08), December, 2008. [PDF]
  • Z. Lan, "FENCE: Fault awareness ENabled Computing Environment", Argonne National Laboratory, May 27, 2008. [PDF]
  • X.-H Sun, "Building a Fault-Aware Computing Environment", Oak Ridge National Laboratory, Feb. 25, 2008. [PDF]
  • Z. Lan, "Adaptive Fault Management for High Performance Computing", Lawrence Livermore National Laboratory, Dec. 13, 2007. [PDF]
  • Z. Lan, "Building a Fault-aware Computing Environment for High End Computing", APART'07 Workshop (in conjunction with SC07), Nov. 11, 2007. [PDF]

  • Recent News:
  • A feature article on our adaptive fault tolerance work in "International Science Grid This Week"

  • Publications:
  • Please go to my research page Click Here for our publications.

  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    This work is supported by US National Science Foundation.