A Machine Learning Framework for Performance Coverage Analysis of Proxy Applications Tanzima Z. Islam, Jayaraman J. Thiagarajan, Abhinav Bhatele, Martin Schulz, Todd Gamblin Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, California 94551 E-mail: {tanzima, jayaramanthi1, bhatele, schulzm, tgamblin}@llnl.gov Abstract—Proxy applications are written to represent subsets of performance behaviors of larger, and more complex applica- tions that often have distribution restrictions. They enable easy evaluation of these behaviors across systems, e.g., for procure- ment or co-design purposes. However, the intended correlation between the performance behaviors of proxy applications and their parent codes is often based solely on the developer’s intuition. In this paper, we present novel machine learning techniques to methodically quantify the coverage of performance behaviors of parent codes by their proxy applications. We have developed a framework, VERITAS, to answer these questions in the context of on-node performance: a) which hardware resources are covered by a proxy application and how well, and b) which resources are important, but not covered. We present our techniques in the context of two benchmarks, STREAM and DGEMM, and two production applications, OpenMC and CMTnek, and their respective proxy applications. Index Terms—Machine learning, Unsupervised learning, Per- formance analysis, Scalability I. I NTRODUCTION As we move towards exascale, it has become important to take application characteristics into account when designing the hardware architecture and vice versa. This approach re- quires an agile co-design loop with hardware architects, system software developers and domain scientists working together to make informed decisions about features and tradeoffs in the design of applications, algorithms, the underlying system software, and hardware. Ideally, all co-design efforts would be driven by perfor- mance measurements directly based on the targeted production applications. However, those applications are often too large or too complex to setup to be used in early design studies, or have distribution restrictions. On the other hand, traditional benchmark suites such as NAS [1] cover a number of pop- ular parallel algorithms but do not include modern adaptive methods such as Monte Carlo or discrete ordinates. This has led to the widespread development of proxy applications to better characterize the performance of complex production applications [2], [3], [4], [5]. Proxy applications are typically developed to capture spe- cific performance-critical modules in a production application. By retaining the parent application’s behavior, they offer convenience and flexibility to analyze performance without requiring the time, effort and expertise to port or modify production codes. Typically, the proxy emulates certain com- putational aspects of the parent such as specific implementa- tions of an algorithm or performance aspects such as memory access patterns. Proxy applications are comparatively small in size, easy to understand, and are typically publicly available for the community to use. For example, XSBENCH is a proxy application for OPENMC that implements a number of random indirect array lookup operations to compute neutron cross-section on a one-dimensional array [3]. However, the corresponding kernel in OPENMC implements a number of additional operations to compute these indices [6]. This poses the following questions – (a) do the two applications have the same performance behavior on a target architecture? and (b) which performance behaviors are different between them? Given the important role proxy applications play in the co- design process, it is critical to understand which salient perfor- mance characteristics of a parent application are covered by a proxy and how well. Unfortunately, the notion of “coverage” is currently highly subjective, and the strategies such as linear correlation and Principal Component Analysis (PCA) adopted for such comparative analysis are ineffective. To the best of our knowledge, this paper is the first to present a principled, machine learning approach that methodically quantifies the quality of a match. We introduce novel machine learning techniques to iden- tify important performance characteristics of applications and quantify the coverage provided by a proxy application com- pared to its parent. Our approach adopts ideas from sparse learning theory to identify performance metrics that can de- scribe the performance characteristics (e.g., efficiency loss) of an application. Further, we define two new metrics – 1. a new quality metric, Resource Significance Measure, to measure the significance of hardware resources in predicting application performance, computed by accumulating the beliefs from each of the constituent metrics in the learned sparse model; and 2. a Coverage metric to indicate the quality of a match between the resource utilization behavior of a proxy and its parent application. Note that instead of aggregating pairwise correla- tions between the individual metrics, our approach constructs subspace models for both proxy and parent using all metrics corresponding to a hardware resource and estimates how well the models agree. In addition to being robust, this provides a principled way to compare multiple metrics simultaneously. We implement these methodologies in VERITAS, a machine learning framework for comparative performance analysis. We focus on on-node performance behaviors of applications to tackle the increasing node complexity on current and up- coming systems. However, similar analysis could be applied to SC16; Salt Lake City, Utah, USA; November 2016 978-1-4673-8815-3/16/$31.00 c 2016 IEEE