Collaborative Research:Experimental-based Research on Effective Models of Parallel Application Execution Time, Power, and Resilience

The increasing scale and complexity of parallel systems present enormous challenges to parallel applications. One such challenge is the integration and balancing of execution time, power, and resilience for parallel applications. The MuMMI_R project seeks to advance the scientific understanding of the interdependence among power, execution time, and resilience for various application-system configurations. The broader impacts include training of undergraduate and graduate students and the participation in programs such as REUs, CREU, and DREU to increase the participation of students from underrepresented groups in the project.

The MuMMI_R research aims to develop effective techniques for quantifying the complicated tradeoffs among execution time, power, and resilience, and to provide a tuning mechanism for user-defined metrics. Toward this goal, the research focuses on three interrelated research thrusts: (1) experimental research to conduct extensive experiments of a suite of representative application under different resilience strategies on various parallel architectures, (2) application-level co-modeling to develop analytical models and colored Petri net based simulation for quantifying the correlations and tradeoffs between execution time, power, and resilience, and (3) model-based analysis to examine the tradeoffs among resilience, execution time, and power for different application-system configurations, and to tune application implementations for a user-defined target metric on current and future systems. The resulting framework, MuMMI_R, will provide valuable insights into application-system interactions and aid in the design of efficient parallel applications (with respect to execution time, power requirements, and resilience), runtime systems, and computer architectures.

This is a collaborative project between two universities: University of Chicago and Illinois Institute of Technology.

Team Members:
  • Valerie Taylor (UChicago, PI)
  • Xingfu Wu (UChicago, co PI)
  • Zhiling Lan (Illinois Tech, co PI)
  • Peixin Qiao (Ph.D. student)
  • Manqi Zhang (Ph.D. student, 2016-2017)

  • Key Publications
  • Xingfu Wu, Valerie Taylor, Jeanine Cook, and Philip Mucci, "Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications", IEEE Computer, Vol. 49, No. 10, pp. 20-29, Oct. 2016.
  • Xingfu Wu, Valerie Taylor and Zhiling Lan, "MuMMI_R: Analyzing and Modeling Power and Time under Different Resilience Strategies", SC2016 Poster, 2016.
  • Xingfu Wu, Valerie Taylor, and Zhiling Lan, "Evaluating Runtime and Power Requirements of Multilevel Checkpointing MPI Applications on Four Parallel Architectures: An Empirical Case Study", Cray User Group Conference, [PDF], 2018.
  • Xingfu Wu, Valerie Taylor, Justin M. Wozniak, Rick Stevens, Thomas Brettin, and Fangfang Xia, "Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta", SC18 Workshop on Python for High-Performance and Scientific Computing, 2018.
  • Xingfu Wu, Valerie Taylor, Justin M. Wozniak, Rick Stevens, Thomas Brettin, and Fangfang Xia, "Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks", Proc. of ICPP'19, 2019.
  • P. Qiao, Z. Lan, X. Wu, and V. Taylor, "Application Power Pattern Characterization: Implications of Power Capping", Technical Presentation at Argone MCS, 2019.

  • Experimental Data:
  • The MuMMI_R database [Link]

  • Contact:
  • Valerie Taylor (vtaylor AT anl DOT gov)
  • Zhiling Lan (lan AT iit DOT edu)

  • Acknowlegement:
    This project is supported by the US National Science Foundation (CCF-1618776). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.