Clustering Performance Data Efficiently at Massive Scales Todd Gamblin , Bronis R. de Supinski , Martin Schulz , Rob Fowler , and Daniel A. Reed Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, CA, 94550, USA Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, 27599, USA Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA {tgamblin, bronis, schulzm}@llnl.gov, rjf@renci.org, daniel.reed@microsoft.com ABSTRACT Existing supercomputers have hundreds of thousands of processor cores, and future systems may have hundreds of millions. Devel- opers need detailed performance measurements to tune their appli- cations and to exploit these systems fully. However, extreme scales pose unique challenges for performance-tuning tools, which can generate significant volumes of I/O. Compute-to-I/O ratios have in- creased drastically as systems have grown, and the I/O systems of large machines can handle the peak load from only a small fraction of cores. Tool developers need efficient techniques to analyze and to reduce performance data from large numbers of cores. We introduce CAPEK, a novel parallel clustering algorithm that enables in-situ analysis of performance data at run time. Our algo- rithm scales sub-linearly to 131,072 processes, running in less than one second even at that scale, which is fast enough for on-line use in production runs. The CAPEK implementation is fully generic and can be used for many types of analysis. We demonstrate its application to statistical trace sampling. Specifically, we use our algorithm to compute efficiently stratified sampling strategies for traces at run time. We show that such stratification can result in data-volume reduction of up to four orders of magnitude on current large-scale systems, with potential for greater reductions for future extreme-scale systems. Categories and Subject Descriptors C.4 [Performance of Systems]: Measurement Techniques; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering General Terms Algorithms, Measurement, Performance This work was performed under the auspices of the U.S. De- partment of Energy, supported by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 (LLNL-CONF- 422684). To appear in ICS’10. Draft typeset on April 20, 2010. c 2010 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan. Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00. 1. INTRODUCTION Scalability is a key challenge in designing parallel performance tools. Petascale supercomputers have over 100,000 cores, and exas- cale systems are expected to have hundreds of millions of cores [1, 11, 21]. Thread counts on these systems may reach into the billions. Programmers need detailed performance information to analyze and to tune application performance, but collecting this information can be a difficult task. For example, load balance is critical for per- formance at scale. Many scientific codes use adaptive techniques in which per-process load depends on application data. In large phys- ical simulations, particles or other model components may move between processes. Overall, load distributions may evolve. Some applications employ adaptive load-balancing techniques to mitigate these problems, but understanding how load balancing schemes perform and diagnosing their limitations requires the generation of load traces across processes and over time. The size of a perfor- mance trace collected from such an application will scale linearly with the process count and simulation length. While performance data sizes continue to grow, I/O to com- pute ratios of modern large-scale machines have shrunken. Further, concurrent output from all processes leads to unacceptable perfor- mance, and application developers must frequently develop novel I/O strategies to limit contention for shared file-system resources. These limits hamper tool developers even more, because tools com- pete for resources against the applications that they measure. Too much output from a tool can severely perturb the application, inval- idating any measurements it may make. An ideal performance tool would collect only measurements per- tinent to an application’s performance. However, tool developers often do not know a priori which data are pertinent. Typical trace tools collect per-timestep, per-process data for off-line analysis. As core counts increase, we must perform some or all of this analysis on-line to avoid saturating the I/O system. This paper’s primary contribution is CAPEK, an algorithm for scalable parallel cluster analysis. We use CAPEK to identify pro- cesses with similar performance behavior and to group them into equivalence classes. Knowing performance equivalence classes al- lows tool developers to focus on a representative subset of pro- cesses, which reduces data volume considerably. We show that CAPEK scales to 131,072 processes on an IBM BlueGene/P sys- tem, and on this system it exhibits a run time suitable for on-line use in production runs. Our implementation of CAPEK is generic, using C++ templates suitable for use on arbitrary data types and for arbitrary distance metrics. We demonstrate its utility by clustering performance trace 1