Matching Application Signatures for Performance Predictions using a Single Execution Anirudh Jayakumar, Prakash Murali, Sathish Vadhiyar Supercomputer Education and Research Centre Indian Institute of Science Bangalore, India jayakumar.anirudh@gmail.com, sercprakash@ssl.serc.iisc.in, vss@serc.iisc.in Abstract—Performance predictions for large problem sizes and processors using limited small scale runs are useful for a variety of purposes including scalability projections, and help in minimizing the time taken for constructing training data for building performance models. In this paper, we present a prediction framework that matches execution signatures for performance predictions of HPC applications using a single small scale application execution. Our framework extracts execution signatures of applications and performs automatic phase identi- fication of different application phases. Application signatures of the different phases are matched with the execution profiles of reference kernels stored in a kernel database. The performance of the reference kernels are then used to predict the performance of the application phases. For phases that do not match significantly, our framework performs static analysis of loops and functions in the application to provide prediction ranges. We demonstrate this integrated set of techniques in our framework with three large scale applications, including GTC, a Particle-in-Cell code for turbulence simulation, Sweep3d, a 3D neutron transport application and SMG2000, a multigrid solver. We show that our prediction ranges are accurate in most cases. KeywordsModeling; Prediction; Matching Application Signa- tures; Kernels; Phase Identification; I. I NTRODUCTION Performance characterization and predictions of parallel applications are essential and have long been used for various purposes including scalability studies [1], identifying perfor- mance bottlenecks [2], projections for future systems [3] and tuning applications and algorithms [4]. A common approach for performance prediction is to execute or benchmark the application for different processors and problem sizes, ob- serve the execution profiles including execution times, and employ curve-fitting and machine learning techniques to map the observed execution profiles to a performance model [5]– [7]. The performance model can then be used for predicting performance for a new problem size and number of processors. In many of the existing strategies, significant number of bench- marks are performed under controlled conditions to obtain performance predictions with reasonable accuracy, resulting in long training times for the model. Limiting the number of benchmarks needed for building the performance models for predictions will help minimize the time taken for performing the benchmarking experiments and the modeling process. Moreover, in certain constrained environments, the benchmark runs and results are implicitly limited. For example, in some large supercomputer systems, application developers execute their applications with small problem sizes on small number of processors of a special queue called debug queue for development, tuning and debugging purposes before performing large scale production runs on large number of processors of production queues. The debug runs are limited and are performed for very small number of problem and system size configurations. A performance modeling system for predicting performance of production runs will have to be built using the limited debug runs. In this work, we have developed a prediction framework for performance predictions of HPC applications using a single small scale application execution. Our framework employs a novel strategy of matching execution profiles of the differ- ent phases of the parallel applications to parallel reference kernels stored in a kernel database. The reference kernels are standard benchmarks from diverse application domains as prescribed by Colella’s seven dwarfs [8], Berkeley View’s thirteen motifs [9], and TORCH testbed of computational reference kernels [10]. Our framework provides a suite, RK- suite, of implementations, execution profiles and performance models of reference kernels. Specifically, RK-suite consists of 1. a collection, RK-collection, of these reference kernel implementations, 2. execution profiles, RK-profile, including cache hits and misses, instruction mix etc. obtained using benchmarking runs of the reference kernels for a finite set of problem sizes and number of processors, and 3. a performance model, RK-model, that can be used to predict execution times of the kernel implementations for other problem sizes and processors. We claim that such a RK-Suite can be useful for a number of purposes including evaluations and comparisons of the high performance computing systems by supercomputer installations, and hardware and software tuning by the vendors. For a given application executed with a small problem size and number of processors, we collect the execution profiles or execution signatures of the application, automatically de- tect the significant phases of the application, and match the normalized execution profiles of the phases and the reference kernels. For example, one of the reference kernels in our RK- collection is a parallel FFT implemented in the NAS Parallel Benchmark (NPB) [11]. The FFT implementation is executed with different problem sizes and number of processors and the execution profiles, RK-profile, are collected for these runs. The execution times are predicted for other problem sizes and number of processors using RK-model. An application like Community Earth System Model (CESM) [12] can consist of FFT calculations as one of its phases. The normalized 2015 IEEE 29th International Parallel and Distributed Processing Symposium 1530-2075/15 $31.00 © 2015 IEEE DOI 10.1109/IPDPS.2015.20 1161