IRON: Reducing Workload Interference on Massively Parallel Platforms

Interconnect networks with Dragongly and Fat tree configurations are dominant in high-performance computing facilities and data centers. A key challenge of managing these shared networks is workload interference. In a multi-user computing environment, interference among applications for shared network resources can cause a vicious cycle of events (workload interference, low productivity, selfish user behavior, and poor scheduling) aggravating each other. This project aims to address this fundamental problem on massively parallel systems by developing the IRON (Interference ReductiON) framework. The project consists of three research thrusts: (1) network simulation to gain insights into communication interference among applications and further to explore various what-if questions in terms of workload interference, (2) interference-aware scheduling to develop intelligent scheduling strategies for avoiding or mitigating network contention among applications, and (3) real-world experiments to quantitatively characterize workload interference and to assess the interference-aware scheduling design.

Completion of the project will create novel interference-aware scheduling policies and scalable software tools for interference analysis and reduction on massively parallel systems with shared network configurations. The resulting data and tools collected from simulations and experiments will be made available to the broad community under an open source license. An integrated education and outreach plan will enhance the Computer Science curriculum, broaden the participation by underrepresented groups, and outreach to the surrounding communities that are predominantly African-American and Latino.

  • Zhiling Lan (PI)

  • Graduate Students:
  • Xin Wang (PhD student)
  • Boyang Li (PhD student)

  • Collaborators:
  • Xu Yang (former PhD student, now at Amazon)
  • Misbah Mubarak (Argonne)
  • Rob Ross (Argonne)
  • Sudheer Chunduri (Argonne)
  • Kevin Harms (Argonne)

  • Key Publications
  • X. Wang, M.Mubarak, Y. Kang, R. Ross, and Z.Lan, "COWG: a Toolkit for Accelerating Analysis of Mixed ", Technical Report, Dept. of Computer Science, Illinois Tech, 2019.
  • Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan, "Modeling and Analysis of Application Interference on Dragonfly+", Proc. of ACM SIGSIM PADS'19 , 2019.[PDF]
  • B. Li, S. Chuduri, K. Harms, Y. Fan, and Z. Lan, "The Effect of System Utilization on Application Performance Variability", Proc. of ROSS'19 (Runtime and Operating Systems for Supercomputers) , 2019.[PDF]
  • X. Wang, M.Mubarak, X. Yang, R. Ross, and Z.Lan, "Trade-off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System", Proc. of IPDPS'18 , 2018.[PDF] The paper is ranked among the top 8% of the 461 conference submissions.

  • Software Tools and Data:
  • (Software) COWG (coNCePTuaL-CODES Online Workload Generator): available on GitLab and GitHub soon!
  • (Software) A Dragonfly+ module for CODES networking simulations. It is released to the public as open-source “dfp-fpar” branch in the CODES Github. [Link]
  • (Software) CQSim - a discrete event driven scheduling simulator. [Link]
  • (Data) Application communication traces collected on the 11.69-petaflow Cray XC40 machine Theta at ALCF. [Link]

  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    This project is supported by the US National Science Foundation (CNS-1717763). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.