Keynote Speaker


Dr. Daniel A. Reed

Chancellor's Eminent Professor
University of North Carolina at Chapel Hill
Director, Renaissance Computing Institute


Title: Optimizing the Computing Cloud

Abstract:HPC systems were once islands, accessible only via low bandwidth networks, most data was computationally derived, and single discipline problems dominated. That world is now a distant memory. We are building petascale systems from commodity processors,game chips, and open source software, and the service clouds of Google, Amazon andMicrosoft already operate at trans-petaop levels. As we move aggressively to a ubiquitous computing cloud that delivers services(computational, data, query ...)new performance analysis and reliability approaches will be required. This talk muses on the future of such approaches.

Bio: Daniel A. Reed is the Chancellor's Eminent Professor at the University of North Carolina at Chapel Hill, and the Director of the Renaissance Computing Institute (RENCI), which exploring on Science and Technology (PCAST), charged with providing advice on science and technology issues and challenges to the President. He is chair of the board of directiors of the Computing Research Association, which represents the major academic departments and industrial research laboratories in North America. He was previously Director of the National Center for Supercomputing Applications(NCSA) and of the principal investigators and chief architect for the NSF TeraGrid.


Speakers


Dr.Francois Bodin

Universite de Rennes 1
IRISA


Title: Addressing Heterogeneity in Many-core Parallel Applications

Abstract:Hybrid multi-core systems such as the ones based on general purpose processor and GPUs can provide tremendous computing power. However, exploiting these hybrid systems from existing applications is a difficult task that requires non portable rewriting of the source. In this talk we present HMPP, a Heterogeneous Multicore Parallel Programming platform that allows the integration of heterogeneous hardware accelerators in a seamless intrusive manner while preserving the legacy code.

Bio:Francois Bodin was born in Rennes on November the 14th, 1961. He received a Ph.D degree in computer science at University of Rennes I(France) in 1989. In 1990, he spent a year as a post doctoral student at University of Indiana in Pr. Gannon's team. He is currently Professor at IRISA(Institute de Recherche en Informatique et Systemes Aleatoires). His research interests include code optimizations and compiler technologies for high performance computers and embedded systems. He has participated to numerous Esprit projects(BRA Apparc, LTR Oceans, R%D Fits, Medea+Mesa,...). He is the founder of the startup company CAPS enterprise in 2002.


Dr.Frederica Darema

CNS/CISE
National Science Foundation


Title:Performance Engineering Large Scale Computing Systems

Abstract:The talk will address research and technology advances for optimized and dependable execution in large scale computing environments, and will emphasize the imperative of performance engineering such systems. Applications in nearly all sectors, scientific, engineering, and commercial, are becoming more encompassing in including the behaviors of the systems they represent, and becoming at the same time more powerful but also more complex. At the same time, driven by application requirements and enabled by hardware technology advances, computational platforms are becoming as well increasingly more powerful and also more complex. Furthermore new paradigms, such as the dynamic data driven applications systems which call for symbiotic dynamic integration of the application computational and measurement aspects, extend the complexity of today's and future computational environments. Such environments require highly adaptive execution, resource discovery and management, and optimized performance, dependability, fault tolerance. It is crucial to advance beyond the traditional roles of performance modeling and analysis, in order to enable efficient and effective design and development of applications, hardware platforms, measurement and instrumentation platforms, including sensor networks, optimized use of the computational and data resources, and guaranteeing quality of service and dependability at all layers of the computational system. This requires performance-engineered hardware and software capabilities at all system layers. An overarching consideration, and thesis of this talk, is that these advances need to be made in a synergistic and integrated manner, taking a systems-view in developing these enabling technologies, rather than advancing each of the individual technologies in an isolated manner.

Bio:Frederica Darema, Ph. D., Fellow IEEE, Senior Executive Service Member Dr. Darema is the Senior Science and Technology Advisor in CNS and CISE, and Director of the Computer Systems Research (CSR) Program. Dr. Darema's interests and technical contributions span the development of parallel applications, parallel algorithms, programming models, environments, and performance methods and tools for the design of applications and of software for parallel and distributed systems. Dr. Darema received her BS degree from the School of Physics and Mathematics of the University of Athens - Greece, and MS and Ph. D. degrees in Theoretical Nuclear Physics from the Illinois Institute of Technology and the University of California at Davis Respectively, where she attended as a Fulbright Scholar and a Distinguished Scholar. After Physics Research Associate positions at the University of Pittsburgh and Brookhaven National Lab, she received an APS Industrial Fellowship and became a Technical Staff Member in the Nuclear Sciences Department at Schlumberger-Doll Research. Subsequently, in 1982, she joined the IBM T. J. Watson Research Center as a Research Staff Member in the Computer Sciences Department and later-on she established and became the manager of a research group at IBM Research on parallel applications. While at IBM she also served in the IBM Corporate Strategy Group examining and helping to set corporate-wide strategies. Dr. Darema was elected IEEE Fellow for proposing in 1984 the SPMD (Single-Program-Multiple-Data) computational model that has become the popular model for programming today's parallel and distributed computers. Dr. Darema has been at NSF since 1994, where she has developed initiatives for new systems software technologies (the Next Generation Software Program), and research at the interface of neurobiology and computing (the Biological Information Technology and Systems Program). She has led the DDDAS (Dynamic Data Driven Applications Systems) efforts including the synonymous cross-Directorate and cross-agency competition, and has also been involved in other cross-Directorate efforts such as the Information Technology Research, the Nanotechnolgy Science and Engineering, the Scalable Enterprise Systems, and the Sensors Programs. During 1996-1998 she completed a two-year assignment at DARPA where she initiated a new thrust for research on methods and technology for performance engineered systems.


Dr.Ewa Deelman

Information Science Institute
University of Southren California


Title: Optimizing for Time and Space in Distributed Scientific Workflows

Abstract:Scientific workflows have become an enabler of complex scientific analyses. They provide a representation of analyses composed of heterogeneous models designed by groups of scientists. At the same time workflows have also become a useful representation that is used to manage the execution of large-scale computations. This representation builds a foundation upon which results can be validated and shared as well as facilitates overall creation and management of the computation. Due to the complexity of the workflows and of the distributed environment often used for their execution, it is important to provide users with abstractions above those provided by today's cyberinfrastructure. As a result there are opportunites for workflow management systems to efficiently map and execute the high-level workflow descriptions onto target resources. In this talk we will describe the Pegasus Workflow Management System and the time and space gravitational-wave physics applications.

Bio: Ewa Deelman is an assistant research professor at the USC Computer Science Department and a Project Leader at the USC Information Sciences Institute. Dr. Deelman's research interests include the design and exploration of collaborative scientific enviroonment based on Grid technologies, with particular emphasis on workflow management as well as the management of large amounts of data and metadata. At ISI, Dr. Deelman is leading the Pegasus project, which designs and implements workflow mapping techniques for large-scale workflows running in distributed environments Prior to joining ISI in 2000, she was a post doctoral fellow at UCLA conducting research in the area of performance prediction of large-scale applications on high performance machines. Dr. Deelman receiver her Ph.D from Rensselaer Polytechnic Institute in Computer Science in 1997 in the area of parallel discrete event simulation.


Doug Joseph
Research Staff Scientist
IBM T.J. Watson Research Center


Title: Exact String Matching on the Cell/B.E. Platform

Abstract: the Multiple Embodiments of the Aho-Corasick Algorithm abstract The demand for fast string searching is exploding. String searching isthe core of productivity, security and network applications likesearch engines, intrusion detection systems, virus scanners and spamfilters. The growing size of on-line contents and the increasing wirespeeds push the need for fast --or indeed real-time, string searchingsolutions.On the computer architecture side, multi-cores are gainingpopularity. They offer unprecedented computing power to the software design community, but also challenges. In fact, compilers andtools are rather immature, and it is unclear which parallelprogramming techniques allow to unleash the most performance.This paper reviews a class of high-performance exact string searchingsolutions that we have optimized for the IBM Cell/B.E. platform on thebasis of the classical Aho-Corasick algorithm. This class provideswith a range of trade-offs between performance and dictionary size.When dictionaries are small enough to fit in the cores' localmemories, the throughput reaches 40 Gbps per processor. With largerdictionaries (as many as hundreds of thousands patterns), a throughputbetween 1.6 and 2.2 Gbps is typical.

Bio: Doug Joseph is a research staff scientist in the Deep Computing group at the IBM T.J. Watson Research Center. His research interests include parallel algorithms / applications and high performance system architecture. His current research focus is hybrid computing algorithms and platform architecture.


Dr.Zhiling Lan
Department of Computer Science
Illinois Institute of Technology


Title: Building a Fault-aware Computing Environment for High End Computing

Abstract:As the scale of high performance computing continues to grow, fault management is becoming a critical challenge. Recent studies have pointed out that the MTBF of teraflop and petaflop machines are only on the order of 10-100 hours. This situation is only likely to deteriorate in the near future, thereby threatening the promising productivity of HPC systems. In this talk, I will describe an on-going research project at Illinois Institute of Technology that aims at building FENCE, a Fault-aware ENabled Computing Environment for HPC. FENCE is "hybrid" by integrating long- term and short-term supports to enhance fault management in HPC. Long-term prediction models the possibility of faults based on historical data and consequently facilitates fault-aware scheduling by intelligently mapping jobs to available resources; and short-term prediction diagnoses the root causes of unusual runtime events and triggers job rescheduling on-the- fly to move running jobs away from these troublesome resources. FENCE is also "adaptive" by combining the merits of the newly emerged proactive fault tolerant approach and the traditional checkpointing approach. We will describe the the design and implementation of FENCE, followed by preliminary results.

Bio: ZHILING LAN is an assistant professor of computer science at Illinois Institute of Technology. She received her PhD degree in Computer Engineering from Northwestern University in 2002. Her research interests are in the area of parallel and distributed systems, in particular, dynamic load balancing, fault tolerant computing, and performance analysis and modeling.


Jim Kowalkowski
Computing Division
Fermi National Accelerator Laboratory


Title: Enhancing the Performance of HEP Reconstruction and Simulation Applications.

Abstract: The CDF,D0, and CMS collider-detector experiments at Fermi lab and CERN each have complex C++ component-based software frameworks for processing particle collision data collected millions of times per second from their detector. The frameworks allow composition of physics algorithm sequences. The purpose of a sequence is to deduce particle interactions that occurred in collisions, starting from a set of energy deposits within the detector. A single set of deposits and derived products are held in a physics collision event, making event processing the fundamental unit of work in these frameworks. Since billions of events are processed for years over hundreds of machines, even modest performance increases are valuable. The millions of lines of algorithmic component code developed by hundreds of physicists for these frameworks are expected to be used for more than a decade. the event processing applications have complex run-time and build environments, long start-up times, large amounts of active code, and use lots of memory. the event execution time depends on the number of interacting particles contained within it. and different code can be run depending on event content. In addition, many of the components use common utility libraries. All of these factors must be taken into account when measuring and attempting to improve the performance of these applications. This main emphasis of this talk will be explain the nature of the applications, show how we locate the root cause of performance problems and help make repairs, show how we verify results in light of inadequate testing, and explain how we report on improvement gains.

Bio:Jim is a senior software engineer at Fermi National Accelerator Laboratory and has been there for the last nine years in Computing Division working with collider-detector experiments and simulation groups. His specialty has been large-scale C++ development, with focus on helping scientists design and implement code using object oriented and generic progmming techniques, and helping improve the performance of their code. More recently he's been helping establish, set direction and coordinate collaborative projects with group outside the laboratory.


Dr.Jesus Labarta


CEPBA Director
European Center for Parallelism
of Barcelona Technical University of Catalonia


Title: Scalability of trace based tools

Abstract:Although traditionally blamed as being not scalable, trace visualization approaches do have the potential to convey to the analyst a large amount of information which is often discarded in global profile approaches. The talk will describe the philosophy, techniques and examples of how CEPBA-tools environment deals with traces of several GB and for up to several thousand processes.

Bio:Jesus labarta is Professor in the Computer Architecture Department at the Technical University of Catalonia (UPC) and Director of the Computer Science research department in the Barcelona Supercomputing Center(BSC). He has a broad research interest on all issues related to supercomputer design, from processor architecture and interconnects to Operating systems, programming models and performance analysis.


Dr.Allen D. Malony

Dept. Computer and Information Science
University of Oregon


Title: Knowledge-based Parallel Performance Data Mining

Abstract:The general goal of any performance analysis tool is to provide the user with a better understanding of performance phenomena captured in experiment measurements. As high-end parallel computer systems and the applications that run on them grow in size and complexity, performance analysis faces challenges of large, high-dimensionality data sets and reasoning about complex performance features. The use of data mining techniques can help by discovering properties in performance measurements and exposing data relationships in more meaningful ways. For example, our PerfExplorer performance data mining system has demonstrated advantages of multi-experiment analysis, dimensionality reduction, clustering, and correlation analysis on large scientific applications. While our work in performance data mining improved analysis automation and allowed access to powerful statistical methods, the results generated only re-described the measurements, they did not "explain" the performance observed. In general, what can be understood about performance is only as good as the data measured AND what is known about it (i.e., what the data means). In order to better interpret performance analysis results, guide performance diagnosis, and conduct performance meta analysis, information beyond the raw performance data must be incorporated. Context metadata and other sources of performance knowledge must be supported in and used by the performance data mining process for productive performance analytics. We have re-engineered our performance data mining framework to incorporate parallel performance data, performance context metadata, expert parallel systems knowledge, and intermediate analysis results in performance analysis and meta-analysis. New methods have been implemented for encoding context metadata and expert knowledge in the performance database and correlating this information with the data mining routines. Knowledge about hardware configurations, libraries, components, input data, algorithmic choices, runtime configurations, compiler choices, and code changes will augment direct performance measurements to make additional analyses possible. Our new framework implements an inference engine to encode and process the expert knowledge, provides user-configurable, scripted control over the process, and includes persistent storage of all intermediate results and analysis provenance. The framework also provides ways to interface with application developers in the performance discovery process. The ability to engage in process programming, knowledge engineering (metadata and inference rules), and results management opens the framework toolset for creating data mining environments specific to the developer's concerns.

Bio:Dr. Allen D. Malony is a Professor in the Department of Computer and Information Science at the Unversity of Oregon. His research interests include performance evaluation and tools for large-scale parallel systems. The TAU Performance System is developed by Malony's research team. He earned a Ph.D. from the University of Illinois, Urbana-Champaign in 1990. Dr. Malony has received an NSF NYI award and an Alexander von Humboldt research award. He has also been a Fulbright Research Scholar to The Netherlands and Austria. Dr. Malony is Directo and CEO of ParatTools, Inc.


Dr.Fabrizio Petrini

Cell Solutions Department
IBM TJ Watson Research Center


Title:Exact String Matching on the Cell/B.E. Platform

Abstract:String searching is the computationally intensive kernel of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real-time, string searching solutions. Multi-core processors are are gaining increasing popularity, thanks to their unprecedented computing power, but they are also bringing new programming challenges. This study describes a class of high-performance exact string searching solutions that we have optimized for the IBM Cell/B.E. processor using the well known Aho-Corasick algorithm. This class provides several trade-offs between performance and dictionary size. When dictionaries are small enough to fit in the local memories of the processing cores, the throughput reaches 40 Gbps per processor. With larger dictionaries (as many as hundreds of thousands patterns), the typical throughput is between 1.6 and 2.2 Gbps per processor.

Bio:Fabrizio Petrini is a senior researcher of the Cell Solution Department of the IBM TJ Watson Research Laboratory. His research interests include various aspects of multi-core processors and supercomputers, including high-performance interconnection networks and network interfaces, fault tolerance, job scheduling algorithms, parallel architectures, operating systems, and parallel programming languages. He has received numerous awards from the US Department of Energy (DOE) for contributions to supercomputing projects, and from other organizations for scientific publications. He is a member of the IEEE


Dr.Patrick H. Worley

Oak Ridge National Laboratory


Title: Performance Analysis in a Time of Development

Abstract:Performance analysis is often performed as part of the development cycle. However, long term development targeting petascale systems, as typified by many of the Department of Energy SciDAC science application projects, can have distinctive performance analysis needs. This presentation describes some of the issues and proposed solutions that have arisen in recent work with climate and fusion SciDAC application projects on the Cray XT4 system at Oak Ridge National Lab. the presentation includes descriptions of effective experimental methodology, appropriate instrumentation, and the importance of performance tracking. The presentation concludes with a discussion of infrastructure that would improve the efficacy of the current process, such as effective utilization of performance data archives, performance model assertions, and performance query functions for common libraries such as MPI.

Bio:Dr. Patrick H. Worley is a senior research computer scientist in the Computer Science and Mathematics Division of Oak Ridge National Laboratory. He earned his Ph.D. in Computer Science from Stanford University in 1988. His research interests include parallel algorithm design and implementation, the performance evaluation of high performance computing systems, and the performance evaluation and optimization of parallel scientific applications.