IRON: Reducing Workload Interference on Massively Parallel Platforms
Interconnect networks with Dragongly and Fat tree configurations are dominant in high-performance computing
facilities and data centers. A key challenge of managing these shared networks is workload interference. In a
multi-user computing environment, interference among applications for shared network resources can cause a vicious cycle of events (workload interference, low productivity, selfish
user behavior, and poor scheduling) aggravating each other. This project aims to address this fundamental
problem on massively parallel systems by developing the IRON (Interference ReductiON) framework. The
project consists of three research thrusts: (1) network simulation to gain insights into communication interference among
applications and further to explore various what-if questions in terms of workload interference, (2)
interference-aware scheduling to develop intelligent scheduling strategies for avoiding or mitigating network
contention among applications, and (3) real-world experiments to quantitatively characterize workload
interference and to assess the interference-aware scheduling design.
Completion of the project will create novel interference-aware scheduling
policies and scalable software tools for interference analysis and reduction on massively parallel systems with
shared network configurations. The resulting data and tools collected from simulations and experiments will be made available to the broad community
under an open source license. An integrated education and outreach plan will enhance the Computer Science
curriculum, broaden the participation by underrepresented groups, and outreach to the surrounding communities that
are predominantly African-American and Latino.
Zhiling Lan (PI)
Xin Wang (PhD student)
Boyang Li (PhD student)
Xu Yang (former PhD student, now at Amazon)
Misbah Mubarak (Argonne)
Rob Ross (Argonne)
Sudheer Chunduri (Argonne)
Kevin Harms (Argonne)
X. Wang, M.Mubarak, Y. Kang, R. Ross, and Z.Lan,
"COWG: a Toolkit for Accelerating Analysis of Mixed ", Technical Report, Dept. of Computer Science, Illinois Tech, 2019.
Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan,
"Modeling and Analysis of Application Interference on Dragonfly+", Proc. of ACM SIGSIM PADS'19 , 2019.[PDF]
B. Li, S. Chuduri, K. Harms, Y. Fan, and Z. Lan,
"The Effect of System Utilization on Application Performance Variability", Proc. of ROSS'19 (Runtime and Operating Systems for Supercomputers) , 2019.[PDF]
X. Wang, M.Mubarak, X. Yang, R. Ross, and Z.Lan,
"Trade-off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System", Proc. of IPDPS'18 , 2018.[PDF]
The paper is ranked among the top 8% of the 461 conference submissions.
Software Tools and Data:
(Software) COWG (coNCePTuaL-CODES Online Workload Generator): available on GitLab and GitHub soon!
(Software) A Dragonfly+ module for CODES networking simulations. It is released to the public as open-source
“dfp-fpar” branch in the CODES Github.
(Software) CQSim - a discrete event driven scheduling simulator.
(Data) Application communication traces collected on the 11.69-petaflow Cray XC40 machine Theta at ALCF.
Dr. Zhiling Lan (lan AT iit DOT edu)
This project is supported by the US National Science Foundation
Note: Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation.