IRON: Reducing Workload Interference on Massively Parallel Platforms

Interconnect networks with Dragonfly and Fat tree configurations are dominant in high-performance computing facilities and data centers. A key challenge of managing these shared networks is workload interference. In a multi-user computing environment, interference among applications for shared network resources can cause a vicious cycle of events (workload interference, low productivity, selfish user behavior, and poor scheduling) aggravating each other. This project aims to address this fundamental problem on massively parallel systems by developing the IRON (Interference ReductiON) framework. The project consists of three research thrusts: (1) flit-level network modeling and simulation to gain insights into workload interference and further to explore various what-if questions in terms of workload interference, (2) interference-aware scheduling to mitigate network congestion/contention among applications, and (3) experiments to quantitatively characterize workload interference and to assess the interference-aware scheduling design.

Completion of the project will make three key contributions to the community: (1) advanced knowledge of workload interference on large-scale systems, (2) novel interference-aware routing and scheduling policies to mitigate workload interference,  and (3) open-source software tools for modeling and mitigating interference on massively parallel systems with shared network configurations.  An integrated education and outreach plan will enhance the Computer Science curriculum, broaden the participation by underrepresented groups, and outreach programs for K-16.

  • Zhiling Lan (PI)

  • Graduate Students:
  • Xin Wang (PhD student)
  • Yao Kang (PhD student)
  • Boyang Li (PhD student)
  • Dustin Favorite (MS student, Spring 2021 - )
  • Naunidh Singh (MS student,  Spring 2021)

  • Collaborators:
  • Xu Yang (former PhD student, now at Amazon)
  • Misbah Mubarak (Argonne, Amazon)
  • Rob Ross (Argonne)
  • Sudheer Chunduri (Argonne)
  • Kevin Harms (Argonne)

  • Key Publications
  • Y. Kang, X. Wang, and Z. Lan, "Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network", ACM HPDC, 2021. [PDF]
  • Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, "Deep Reinforcement Agent for Scheduling in HPC", IPDPS, 2021. [PDF]
  • Y. Fan and Z. Lan, "DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling", Software Impacts, 2021. [PDF]
  • B. Li, Y. Fan, and Z. Lan, "Direct Future Prediction Agent for Multi-Resource Scheduling in HPC", Technical report, Illinois Tech, May 2021.
  • X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, "Union: An Automatic Workload manager for Accelerating Network Simulation", Proc. of IPDPS, 2020. [PDF]
  • Y. Kang, "Study of I/O and Communication Traffic Interference on Dragonfly System", Talk at Argonne National Lab, August, 2019.
  • Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan, "Modeling and Analysis of Application Interference on Dragonfly+", Proc. of ACM SIGSIM PADS'19 , 2019.[PDF]
  • B. Li, S. Chuduri, K. Harms, Y. Fan, and Z. Lan, "The Effect of System Utilization on Application Performance Variability", Proc. of ROSS'19 (Runtime and Operating Systems for Supercomputers) , 2019.[PDF]
  • X. Wang, M.Mubarak, X. Yang, R. Ross, and Z.Lan, "Trade-off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System", Proc. of IPDPS'18 , 2018.[PDF

  • Software Tools and Data:
  • (Software) Union/CODES. It is an automatic workload manager for concurrent execution of synthetic workload, SWM skeleton, and Conceptual skeleton in CODES. It is available on the team's GitHub [Link]
  • (Software) CODES Dragonfly+ Module. It is released to the public as open-source dfp-fpar branch in the CODES GitHub. [Link]
  •  (Software) Q-adaptive/SST. It is a reinforcement learning driven adaptive routing design for Dragonfly systems.  It is available on the team's GitHub [Link]
  • (Software) DRAS/CQSim - a discrete event driven scheduling simulator empowered by reinforcement learning. It is available on the team's GitHub [Link]
  • (Data) Application communication traces collected on the 11.69-petaflow Cray XC40 machine Theta at ALCF. [Link]

  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    This project is supported by the US National Science Foundation (CNS-1717763). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.