SPEaR: Toward Smart HPC through Active Learning and Intelligent Scheduling

As high performance computing (HPC) continues to grow in scale, energy and resilience become first-class concerns, in addition to the pursuit of performance. These concerns demand significant changes in many aspects of the system stack including resource management and job scheduling. In order to harness the great potential of extreme scale systems, this project aims to incorporate intelligence into resource management and job scheduling. More specifically, it will develop a framework named SPEaR (Scheduling for Performance, Energy, and Resilience efficiency) for dynamically optimizing the three-dimensional performance, energy, and resilience scheduling. The research focuses on two thrusts: one is active learning to automatically extract valuable performance, energy, and resilience patterns and tradeoffs out of application and system data, and the other is intelligent scheduling to improve and control performance, resilience, and energy efficiency in resource management and scheduling. An event-driven scheduling simulator is being developed for comprehensively evaluating scheduling policies and their aggregate effects. The simulator, along with system logs, will be made available to the broad community under an open source license.

This project creates critical technologies to promote system productivity and makes important advances essential toward smart HPC. Additionally, the learning techniques developed in this project are useful to other big data problems of national interests. The education plan enhances the undergraduate and graduate curricula and broadens the participation from underrepresented groups.

  • Zhiling Lan (faculty)
  • Li Yu (Ph.D. student)
  • Sean Wallace (Ph.D. student)
  • Xu Yang (Ph.D. student)
  • Zhou Zhou (Ph.D. student)
  • Eduardo Berrocal (Ph.D. student)
  • Xin Wang (Ph.D. student)

  • Collaborators:
  • Mike Papka at Argonne
  • Paul Rich at Argonne
  • Wei Tang at Argonne

  • Key Publications:
  • Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints", Proc. of IPDPS'15, , 2015.
  • E. Berrocal, L. Yu, S. Wallace, M. Papka, and Z. Lan, "Exploring Void Search for Fault Detection on Extreme Scale Systems" (Best Paper Award), Proc. of IEEE Cluster'14, , 2014.[PDF]
  • X. Yang, X. Zheng, Z. Zhou, W. Tang, J. Wang, and Z. Lan, "Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems", Proc. of IEEE Cluster'14, , 2014.[PDF]
  • Z. Zheng, L. Yu, and Z.Lan, "Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart", To appear in the IEEE Trans. on Computers, , 2014.
  • X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, "Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems", Proc. of SC'13, 2013. [PDF]

    Software Tools:
  • (Software) CQSim - a discrete event driven scheduling simulator. [Link]

  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    This project is supported by the US National Science Foundation. Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.