Tools and Software:
-
SysDP:
a System Diagnosis and Prognosis toolkit. Currently, it has been
tested with RAS (Reliability, Availability, and Serviceability) logs from Blue Gene/L
systems and Cray XT4 systems. [Coming Soon!]
-
FT-Pro:
an application-level adaptive fault tolerance system for parallel
applications. Here, "application-level" means the focus is on reducing application completion time in
the presence of failure. It allows applications to avoid anticipated failures via
preventive migration, and in the case of unforeseeable failures, to minimize their impact through
selective checkpointing. It is implemented with the MPICH-V checkpointing package.
-
FARS:
a Fault-Aware Runtime System for system-level adaptive fault tolerance. Here, "system-level" means the
primary goal is to improve system productivity in the presence of failure. It not only includes runtime
strategies to allocate spare nodes for failure avoindance, but also provides a general mechanism to select
running jobs for rescheduling in case of resource contention.
An event-driven simulator is developed to emulate
computing systems using batch scheduler enhanced with FARS. It has been tested
with both synthetic data and machine traces collected from production systems.
-
FREM:
a Fast REstart Mechanism to improve process recovery for general checkpoint/restart protocols. The core
idea is to enable early process restart on partial checkpoint image by tracking data access patterns after
each checkpoint. A prototype system which implements FREM with the BLCR checkpointing tool is developed.
We have tested it with SPEC 2006. Email to Request Software
-
ParaDLB and DistDLB:
Dynamic load balancing methods for large-scale applications using the structured adaptive mesh refinement
(SAMR) algorithm.
The methods have been implemented and tested in the cosmological simulation code ENZO.