SPEaR: Toward Smart HPC through Active Learning and Intelligent Scheduling

As high performance computing (HPC) continues to grow in scale, energy and resilience become first-class concerns, in addition to the pursuit of performance. These concerns demand significant changes in many aspects of the system stack including resource management and job scheduling. In order to harness the great potential of extreme scale systems, this project aims to incorporate intelligence into resource management and job scheduling. More specifically, it will develop a framework named SPEaR (Scheduling for Performance, Energy, and Resilience efficiency) for dynamically optimizing the three-dimensional performance, energy, and resilience scheduling. The research focuses on two thrusts: one is active learning to automatically extract valuable performance, energy, and resilience patterns and tradeoffs out of application and system data, and the other is intelligent scheduling to improve and control performance, resilience, and energy efficiency in resource management and scheduling. An event-driven scheduling simulator is being developed for comprehensively evaluating scheduling policies and their aggregate effects. The simulator, along with system logs, will be made available to the broad community under an open source license.

This project creates critical technologies to promote system productivity and makes important advances essential toward smart HPC. Additionally, the learning techniques developed in this project are useful to other big data problems of national interests. The education plan enhances the undergraduate and graduate curricula and broadens the participation from underrepresented groups.

A 1-page poster summary is available:poster-2016.

Participants:
  • Zhiling Lan (PI)
  • Boyang Li (Ph.D. student)
  • Xin Wang (Ph.D. student)
  • Yuping Fan (Ph.D. student)
  • Peixin Qiao (Ph.D. student)
  • Yao Kang (Ph.D. student)
  • Li Yu (graduated in 2016, and joined Google Inc.)
  • Eduardo Berrocal (granduated in 2017, and joined Intel)
  • Sean Wallace (graduated in 2017, and joined Cray)
  • Xu Yang (graduated in 2017, and joined Amazon)

  • Collaborators:
  • Mike Papka (Argonne & NIU)
  • Susan Coghlan (Argonne)
  • William Allcock (Argonne)
  • Franck Cappello (Argonne)
  • Rob Ross (Argonne)
  • Venkat Vishwanath (Argonne)
  • John Jenkins (Argonne)
  • Paul Rich (Argonne)
  • Misbah Mubarak (Argonne)
  • Sheng Di (Argonne)
  • Leonardo Bautista-Gomez (Argonne)
  • Vitali Morozov (Argonne)
  • Wei Tang (Google)
  • Narayan Desai (Google)

  • Key Publications [Link]
  • L. Yu, Z. Zhou, Y, Fan, M. Papka, and Z. Lan, "System-wide Treadeoff Modeling of Performance, Power, and Resilience on Petascale Systems", Journal of Supercomputing , 2018.
  • E. Berrocal, L. Gomez, D. Sheng, Z. Lan, and F. Cappello, "Toward General Software Level Silent Data Corruption Detection for Parallel Applications", IEEE. Trans. on Parallel and Distributed Systems (TPDS), 2017.
  • P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan, "Joint Effects of Application Communication Pattern, Job Placement, and Network Routing on Fat-Tree Systems", Proc. of ICPP-W , 2018.
  • Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan, "Trade-off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates", Proc. of IEEE Cluster'17 (acceptance rate is 21.8%), 2017.
  • W. Allcock, P. Rich, Y. Fan, and Z. Lan, "Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne", Proc. of the 21st workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 2017.
  • X. Yang, J. Jenkins, M. Mubarak, R. Ross, and Z.Lan, "Watch Out for the Bully! Job Interference Study on Dragonfly Network", Proc. of SC16 (acceptance rate is 18%), 2016.
  • S. Wallace, X. Yang, V. Vishwanath, W. Allcock, S. Coghlan, M. Papka, and Z. Lan, "A Data Driven Scheduling Approach for Power Management on HPC Systems", Proc. of SC16 (acceptance rate is 18%), 2016.
  • Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints", IEEE Transactions on Parallel and Distributed Systems , 2016.
  • Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints", IEEE Transactions on Parallel and Distributed Systems , 2016.
  • Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, "I/O Aware Job Scheduling and Bandwidth Allocation for Petascale Computing Systems", Journal of Parallel Computing (ParCo), 2016.
  • S. Wallace, Z. Zhou, V. Vishwanath, S. Coghlan, J. Tramm, Z. Lan, and M.E. Papka, "Application Power Profiling on IBM Blue Gene/Q", Journal of Parallel Computing (ParCo) , 2016.
  • E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello, "Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications", Proc. of Euro-Par, 2016.
  • L. Yu, Z. Zhou, S. Wallace, M.E, Papka, and Z. Lan, "Quantitative Modeling of Power Performance Tradeoffs on Extreme Scale Systems", Journal of Parallel and Distributed Computing , 2015.
  • L. Yu and Z. Lan, "A Scalable, Non-Parametric Anomaly Detection Method for Large Scale Computing", IEEE Transactions on Parallel and Distributed Systems , vol. 99(7), pp. 1902-1914, 2015.
  • E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello, "Lightweight Silient Data Correpution Detection Based on Runtime Data Analysis for HPC Applications" (short paper), Proc. of HPDC'15, 2015.
  • E. Berrocal, L. Yu, S. Wallace, M. Papka, and Z. Lan, "Exploring Void Search for Fault Detection on Extreme Scale Systems" (Best Paper Award), Proc. of IEEE Cluster'14 ,2014.
  • Z. Zheng, L. Yu, and Z.Lan, "Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart", IEEE Trans. on Computers , 2014.

  • Software Tools and Data:
  • (Software) CQSim - a trace-based, event-driven scheduling simulator. [Link]
  • (Software) PuPPET - a Petri net based modeling tool for quantitative power and performance analysis [Link]
  • (Software) TOPPER - a Petri net based modeling tool for quantitative analysis of performance, power, and resilience [Link]
  • (Data) Workload traces at ALCF [Link].

    Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    Acknowlegement:
    This project is supported by the US National Science Foundation (CCF-1422009). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.