The MuMMI_R Project

Collaborative Research:Experimental-based Research on Effective Models of Parallel Application Execution Time, Power, and Resilience

The increasing scale and complexity of parallel systems present enormous challenges to parallel applications. One such challenge is the integration and balancing of execution time, power, and resilience for parallel applications. The MuMMI_R project seeks to advance the scientific understanding of the interdependence among power, execution time, and resilience for various application-system configurations. The broader impacts include training of undergraduate and graduate students and the participation in programs such as REUs, CREU, and DREU to increase the participation of students from underrepresented groups in the project.

This project aims to develop effective techniques for quantifying the complicated tradeoffs among execution time, power, and resilience, and to provide a tuning mechanism for user-defined metrics. The project consists of three research thrusts: (1) experimental study of different application-system configurations, (2) developing models for quantifying the interplay between runtime, power, and resilience, and (3) model-based analysis. The resulting framework, MuMMI_R, can provide valuable insights into application-system interactions and aid in the design of efficient parallel applications (with respect to execution time, power requirements, and resilience), runtime systems, and computer architectures. The key outcomes include technical papers, a user-level dynamic power capping library, and a large amount of experiment data for the community.

This is a collaborative project between two universities: University of Chicago and Illinois Institute of Technology.

Team Members

Valerie Taylor (UChicago, PI)
Xingfu Wu (UChicago, co PI)
Zhiling Lan (Illinois Tech, co PI)
Sahil Sharma (BS, 1/2020-5/2021)
Avery Peck (BS, 1/2020-12/2020)
Boyang Li (PhD, 9/2019-12/2020)
Xin Wang (PhD, 6/2020-8/2020)
Melanie Cornelius (PhD, 8/2020-3/2021)
Peixin Qiao (Ph.D., 2017-12/2019)
Manqi Zhang (Ph.D., 2016-2017)

Key Publications

X. Wu, V. Taylor, J. Cook, and P. Mucci, "Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications", IEEE Computer, Vol. 49, No. 10, pp. 20-29, Oct. 2016.
X. Wu, V. Taylor and Z. Lan, "MuMMI_R: Analyzing and Modeling Power and Time under Different Resilience Strategies", SC2016 Poster, 2016.
X. Wu, V. Taylor, and Z. Lan, "Evaluating Runtime and Power Requirements of Multilevel Checkpointing MPI Applications on Four Parallel Architectures: An Empirical Case Study", Cray User Group Conference, 2018.
X. Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, and F. Xia, "Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta", SC18 Workshop on Python for High-Performance and Scientific Computing, 2018.
X. Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, and F. Xia, "Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks", Proc. of ICPP'19, 2019.
P. Qiao, Z. Lan, X. Wu, and V. Taylor, "Application Power Pattern Characterization: Implications of Power Capping", Technical Presentation at Argonne MCS, 2019.
X. Wu, V. Taylor, Z. Lan, "Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods", Cray User Group Conference, 2020.
Xingfu Wu, Aniruddha Marathe , Siddhartha Jana, Ondrej Vysocky, Jophin John, Andrea Bartolini, Lubomir Riha, Michael Gerndt, Valerie Taylor, and Sridutt Bhalachandra, "Toward an End-to-End Auto-tuning Framework in HPC PowerStack", Energy Efficient HPC State of Practice 2020 (EE HPC SOP 20), Sep. 14-17, 2020, Kobe, Japan.
Xingfu Wu and Valerie Taylor, "Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks", Concurrency and Computation Practice and Experience, 2021, e6515, https://doi.org/10.1002/cpe.6516.
S. Sharma, Z. Lan, X. Wu, and V. Taylor, "A Dynamic Power Capping Library for HPC Applications", IEEE Cluster (2-page poster), 2021.
S. Sharma, Z. Lan, X. Wu, and V. Taylor, "DNPC: a Dynamic Node-level Power Capping Library for Scientific Applications", Undergraduate Research Journal at Illinois Tech, Spring 2021.
M. Cornelius, A. Peck, Z. Lan, W. Allcock, and B. Toonen, "A Study of NPB and CANDLE on Commercial Off-the-Shelf Disaggregated Memory", The 2nd Workshop on Resource Disaggregation and Serverless (WORDS '21), co-located with ASPLOS'21.

Software & Data

The MuMMI_R database [Link]

The open-source dynamic power capping library called DNPC [Link]

Contact

Valerie Taylor (vtaylor AT anl DOT gov)

Zhiling Lan (lan AT iit DOT edu)

Acknowledgement:
This project is supported by the US National Science Foundation (CCF-1618776 and CCF-1801856). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.