Invited speakers


Kirk W. Cameron

Virginia Polytechnic Institute and StateUniversity


Title: The HPC Energy Crisis

Abstract: An energy crisis is looming for the HPC community. Petaflop systems may require 100 megawatts of power, nearly the output of a small power plant (300 megawatts). This would supply 1.6 million 60-watt light bulbs; lighting for a small city. Such systems will cost millions to operate annually and produce tremendous amounts of heat leading to reduced reliability. Thus, power is now a first-class constraint in clusters designed for scientific computing. Today’s HPC challenge is to reduce the power consumption of emergent systems without sacrificing performance. To address this challenge, we have created an infrastructure that enables profiling, analysis, control, and optimization of the power consumed by high-end applications on power-scalable clusters. We use this framework to show the key to saving energy in HPC clusters lies in identifying and exploiting performance inefficiencies in parallel scientific applications. We have used power-scalable prototypes of high-end systems to achieve system-wide energy savings as large as 30% with less than 1% performance loss. In this talk, we describe our techniques and results, lessons learned, and the future directions of high-performance, power-aware distributed computing.

Bio: Kirk W. Cameron is an associate professor in the Department of Computer Science and director of the Scalable Performance (SCAPE) Laboratory at Virginia Polytechnic Institute and State University. His research interests include high-performance and grid computing, parallel and distributed systems, computer architecture, power-aware systems, and performance evaluation and prediction. Cameron received a PhD in computer science from Louisiana State University. He is a recipient of NSF and DOE Career awards, and is a member of the IEEE and the IEEE Computer Society.


Karl Fürlinger

Technical University of Munich, Germany


Title: Scalable Automated Performance Analysis using Performance Properties

Abstract: We present our approach for automating the performance analysis of parallel applications based on the idea of ASL performance properties. Our tool Periscope automatically searches for inefficiencies specified as ASL properties, leveraging a set of agents distributed over the target machine and arranged in a tree-like hierarchy. Decomposing the analysis using a set of agents allows the analysis process to be performed in a scalable way. If the machine or target application scales in number of nodes or processors used, Periscope similarly scales in number of agents employed.

Bio: Karl Fürlinger is a research associate at the Chair for Computer Architecture at the Institute for Computer Science at the Technical University of Munich, where he works on performance analysis of parallel applications. He received diplomas in mathematics and computer science from the University of Salzburg, Austria and his PhD from TU Munich, Germany.

Michael Gerndt

Technical University of Munich, Germany


Title: Scalable Automated Performance Analysis using Performance Properties

Abstract: We present our approach for automating the performance analysis of parallel applications based on the idea of ASL performance properties. Our tool Periscope automatically searches for inefficiencies specified as ASL properties, leveraging a set of agents distributed over the target machine and arranged in a tree-like hierarchy. Decomposing the analysis using a set of agents allows the analysis process to be performed in a scalable way. If the machine or target application scales in number of nodes or processors used, Periscope similarly scales in number of agents employed.

Bio: My current affiliation is with Technische Universität München as Associate Professor for Architecture of Parallel and Distributed Systems. I am working on automatic performance analysis for parallel programs. My research interests include language design, compilation techniques, and programming tools for parallel and distributed systems.

Don Holmgren

Fermi National Accelerator Laboratory

Title: Optimization of Lattice QCD Calculations

Abstract: Over the past five years, the Department of Energy has supported the optimization of lattice QCD calculations through their SciDAC (Scientific Discovery through Advanced Computing) program. This talk will discuss the results of this program, including the development and performance of communications, I/O, SU(3) algebra, and data parallel computation libraries. The talk will also discuss the development and performance of hardware systems dedicated to these calculations.

Bio: Don Holmgren is a member of the Software Coordinating Committee of the SciDAC Lattice QCD Computing Project. A staff member at the Fermi National Accelerator Laboratory since 1995, Dr. Holmgren works in the areas of high performance data acquisition for experiments and parallel computing for lattice QCD calculations. He has lead the lattice QCD project in the Computing Division at Fermilab since 1999. Dr. Holmgren holds a PhD in Experimental Condensed Matter Physics from the University of Illinois in Urbana-Champaign.

Peter Kacsuk

Computer and Automation Research Institute, Hungarian Academy of Sciences


Title: Grid interoperability by P-GRADE Grid portal

Abstract: Grid interoperability can be solved at the workflow level inside a Grid portal. Indeed, P-GRADE Grid portal can solve the interoperability problem at the workflow level with great success. It can simultaneously distribute and execute different components of a workflow in several Grids even if they are based on different Grid technologies. In this way the user can exploit more parallelism than inside a single Grid. More than that the workflow level completely hides the low level Grid details for the end-user who has not to learn the low level Grid commands of different Grids and can port workflow applications between different Grids with minimal user efforts. It also eliminates application porting work when a production Grid moves to a new Grid middleware (e.g. EGEE moved from LCG-2 to gLite). Since second generation Grid middleware like GT2 supports job submission activities meanwhile third generation Grid middleware like GT4 provides service-oriented usage of the Grid, P-GRADE portal supports both job submission and service invocation inside workflows. For the purpose of service invocation P-GRADE portal was integrated with the GEMLCA legacy code architecture developed by the University of Westminster to turn legacy codes into Grid services without touching either the source or binary code of legacy applications. The current NGS P-GRADE portal working as a service for the UK NGS and the GIN VO portal of OGF demonstrate that the GT2-based NGS, OSG and TeraGrid can be made interoperable with GT4 and EGEE sites through these portals.

Bio: Peter Kacsuk is the Head of the Laboratory of the Parallel and Distributed Systems in the Computer and Automation Research Institute of the Hungarian Academy of Sciences. He received his MSc and university doctorate degrees from the Technical University of Budapest in 1976 and 1984, respectively. He received the kandidat degree from the Hungarian Academy in 1989. He habilitated at the University of Vienna in 1997. He recieved his professor title from the Hungarian President in 1999 and the Doctor of Academy degree (DSc) from the Hungarian Academy of Sciences in 2001. He has been a part-time full professor at the Cavendish School of Computer Science of the University of Westminster and the Eötvös Lóránd University of Science since 2001. He served as visiting scientist or professor several times at various universities of Australia, Austria, Canada, England, Germany, Spain, Japan and USA. He has published two books, two lecture notes and more than 200 scientific papers on parallel computer architectures, parallel software engineering and Grid computing. He is co-editor-in-chief of the Journal of Grid Computing published by Springer.


Bernd Mohr

Research Center Jülich,
Germany


Title: Scalable Performance Analysis of Large Scale Applications

Abstract: Automatic trace analysis is an effective method for identifying complex performance phenomena in parallel applications. However, as the size of parallel systems and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file, as done by KOJAK's EXPERT trace analyzer, becomes increasingly constrained by the large number of events. In this article, we present a scalable version of the EXPERT analysis based on analyzing separate local trace files with a parallel tool which replays the target application's communication behavior. We describe the new parallel analyzer architecture and discuss first empirical results obtained on a 16384-processor system.

Bio: Dr. Bernd Mohr started to design and develop tools for performance; analysis of parallel programs already with his diploma thesis at the University of Erlangen in Germany, and continued this in his Ph.D. work. During a three year PostDoc position at the University of Oregon, he was responsible for the design and implementation of the original TAU performance analysis framework for the parallel programming language pC++. Since 1996 he is a senior scientist at the Research Center Jülich. Besides being responsible for user support and training in regard to performance tools, he is leading the KOJAK research group on automatic performance analysis of parallel programs which is a joint project with the Innovative Computing Laboratory at the University of Tennessee. He was a founding member and work package leader of the European Community IST working group on automatic performance analysis: APART. He is the author of several dozen conference and journal articles about performance analysis and tuning of parallel programs.

Scott Pakin

Los Alamos National Laboratory


Title: Beyond Ping-Pong: Application-Centric Measurements of Network Performance

Abstract: As applications are scaled to ever-larger numbers of processors, application performance depends more heavily on the performance of the underlying interconnection network; Microbenchmarks are commonly used to evaluate the performance of various networks. However, microbenchmarks tend to be divorced from the manner in which applications actually utilize the network; In this talk, we present a software framework for rapidly and easily developing network benchmarks that are more representative of application behavior.

Bio: Scott Pakin is a Technical Staff Member in the Performance and Architecture Lab, part of the Modeling, Algorithms and Informatics group (CCS-3) at Los Alamos National Laboratory; His current research interests include analyzing and improving the performance of high-performance computing systems, with particular emphasis on the communication subsystem; He has published papers on such topics as high-speed messaging layers, language design and implementation, job-scheduling algorithms, and resource-management systems; He received a B.S. in Mathematics/Computer Science with Research Honors from Carnegie Mellon University in May 1992, an M.S. in Computer Science from the University of Illinois at Urbana-Champaign in January 1995, and a Ph.D. from the University of Illinois at Urbana-Champaign in October 2001.


Xian-He Sun

Scalable Computing Software (SCS) laboratory, Illinois Institute of Technology


Title: Remove the Memory Wall: From performance analysis to architecture optimization

Abstract: It is generally agreed that performance evaluation is an important component of high performance computing. However, many see performance evaluation as benchmarking or no more than locating performance bottlenecks. In this talk we arguer that performance evaluation consists of many components, including measurement, analysis, modeling, optimization, and tools. With an in-depth understanding of performance tradeoff, performance evaluation often can lead to unconventional approaches that improve performance significantly. Data access is a known bottleneck of high performance computing (HPC). Although advanced memory hierarchies and parallel file systems have been developed in recent years, they only provide high bandwidth for contiguous, well-formed data streams, but performing poorly for accessing small, noncontiguous data. The problematic memory wall remains after years of study. Through performance evaluation, we propose the “Server-Push” data access architecture for HPC. Unlike traditional designs where data is stored and retrieved via request (pull based), in the “Server-Push” model a data access server proactively pushes data from a file server to the compute node’s memory or from it’s memory to it’s cache based on the architecture design. Performance modeling is used to decide what data to fetch, whereas performance prediction is the mean to determine when to fetch the data. Experimental results show that with the new approach the cache hit rates increase well above 90% for various benchmark applications that are notorious for poor cache performance. Our current success illustrates the power and unique role of performance evaluation in computing.

Bio: Dr. Xian-He Sun is a professor of Computer Science at the Illinois Institute of Technology (IIT) and the director of the Scalable Computing Software (SCS) laboratory at IIT. He is a guest faculty at the Argonne National Laboratory and a visiting scientist at the Fermi National Laboratory. Before joining IIT, he was a post-doctoral researcher at the Ames Laboratory, a staff scientist at ICASE, NASA Langley Research Center, an ASEE fellow at the Naval Research Laboratory, and a professor at the Louisiana State University. His research interests include high performance computing, performance evaluation, and distributed systems. Dr. Sun is a founding member of APART. He is a regular participant of the IEEE SC conference since 1990 and has been serving on the SC technical committee since 2003.


Valerie Taylor

Department of Computer Science, Texas A&M University


Title: Prophesy: Performance Modeling and Analysis of Parallel Applications

Abstract: Performance models provide significant insight into the performance relationships between an application and the system used for execution. In particular, models can be used to predict the relative performance of different systems used to execute an application or to explore the performance impact of the use of different algorithms to solve a given task. Prophesy is a web-based environment that attempts to automate the process for performance modeling and analysis of parallel applications. In this talk, I will discuss the Prophesy environment and present some examples of how Prophesy has been used for resource selection within grid environments, performance analysis of applications executing on clusters of SMPs, and performance tuning for efficient utilization of parallel systems.

Bio: Valerie E. Taylor earned her B.S. in Electrical and Computer Engineering and M.S. in Computer Engineering from Purdue University in 1985 and 1986, respectively, and a Ph.D. in Electrical Engineering and Computer Science in 1991. From 1991-2002, Dr. Taylor was a member of the faculty in the Electrical and Computer Engineering Department at Northwestern University. Dr. Taylor joined the faculty of Texas A&M University as Head of the Dwight Look College of Engineering's Department of Computer Science in January 2003 and is the holder of the Royce E. Wisenbaker Professorship II. Her research interests are in the area of high performance computing, with particular emphasis on performance analysis and modeling of parallel and distributed applications and mesh partitioning for distributed systems. She has authored or co-authored over 80 papers in these areas. Dr. Taylor has received numerous awards for distinguished research and leadership. She is a member of ACM and Senior Member of IEEE-CS.


Jeffrey Vetter

Computer Science and Mathematics Division, Oak Ridge National Laboratory


Title: Application Accelerators: dues ex machina?

Abstract: Commodity computing systems are rapidly moving to homogenous multicore processors as a strategy to continue improving performance while confronting the constraints of power, heat, signaling, and instruction level parallelism. Enter application accelerators. Numerous hardware accelerators have recently appeared on the supercomputing scene: FPGAs, ClearSpeed, STI Cell, Graphical Processing Units, etc. Our initial investigations have revealed that these accelerators can dramatically improve the performance of specific algorithms; in one example, our acceleration of an ORNL protein-folding application with FPGAs has shown good speedups. Nevertheless, accelerators face numerous hurdles to widespread adoption, such as programmer productivity and unstable performance reactivity. We are tackling these challenges in the Siskiyou project at ORNL with new performance modeling tools and a software system targeted at tightly-coupled heterogeneous computing systems.

Bio: Jeffrey Vetter is a computer scientist in the Computer Science and Mathematics Division (CSM) of Oak Ridge National Laboratory (ORNL), where he leads the Future Technologies Group, and he is also a joint Professor in the College of Computing at Georgia Tech. His research interests lay largely in the areas of experimental software systems and architectures for high-end computing. Jeff earned his Ph.D. in Computer Science from the Georgia Institute of Technology; he joined CSM in 2003.


Felix Wolf

Forschungszentrum Jülich, RWTH Aachen University, Germany


Title: Scalable Performance Analysis of Large Scale Applications

Abstract: Automatic trace analysis is an effective method for identifying complex performance phenomena in parallel applications. However, as the size of parallel systems and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file, as done by KOJAK's EXPERT trace analyzer, becomes increasingly constrained by the large number of events. In this article, we present a scalable version of the EXPERT analysis based on analyzing separate local trace files with a parallel tool which replays the target application's communication behavior. We describe the new parallel analyzer architecture and discuss first empirical results obtained on a 16384-processor system.

Bio: Felix Wolf is an assistant professor at RWTH Aachen University and leader of the research group Performance Analysis of Parallel Programs at Forschungszentrum Jülich, Germany. Between 2003 and 2005, he worked at the Innovative Computing Laboratory at the University of Tennessee. He received his Ph.D. from RWTH Aachen University in 2003. His primary research interest is performance analysis on large-scale systems.