Matching Application Signatures for Performance
Predictions using a Single Execution
Anirudh Jayakumar, Prakash Murali, Sathish Vadhiyar
Supercomputer Education and Research Centre
Indian Institute of Science
Bangalore, India
jayakumar.anirudh@gmail.com, sercprakash@ssl.serc.iisc.in, vss@serc.iisc.in
Abstract—Performance predictions for large problem sizes
and processors using limited small scale runs are useful for a
variety of purposes including scalability projections, and help
in minimizing the time taken for constructing training data
for building performance models. In this paper, we present
a prediction framework that matches execution signatures for
performance predictions of HPC applications using a single small
scale application execution. Our framework extracts execution
signatures of applications and performs automatic phase identi-
fication of different application phases. Application signatures of
the different phases are matched with the execution profiles of
reference kernels stored in a kernel database. The performance of
the reference kernels are then used to predict the performance of
the application phases. For phases that do not match significantly,
our framework performs static analysis of loops and functions
in the application to provide prediction ranges. We demonstrate
this integrated set of techniques in our framework with three
large scale applications, including GTC, a Particle-in-Cell code
for turbulence simulation, Sweep3d, a 3D neutron transport
application and SMG2000, a multigrid solver. We show that our
prediction ranges are accurate in most cases.
Keywords—Modeling; Prediction; Matching Application Signa-
tures; Kernels; Phase Identification;
I. I NTRODUCTION
Performance characterization and predictions of parallel
applications are essential and have long been used for various
purposes including scalability studies [1], identifying perfor-
mance bottlenecks [2], projections for future systems [3] and
tuning applications and algorithms [4]. A common approach
for performance prediction is to execute or benchmark the
application for different processors and problem sizes, ob-
serve the execution profiles including execution times, and
employ curve-fitting and machine learning techniques to map
the observed execution profiles to a performance model [5]–
[7]. The performance model can then be used for predicting
performance for a new problem size and number of processors.
In many of the existing strategies, significant number of bench-
marks are performed under controlled conditions to obtain
performance predictions with reasonable accuracy, resulting in
long training times for the model.
Limiting the number of benchmarks needed for building
the performance models for predictions will help minimize
the time taken for performing the benchmarking experiments
and the modeling process. Moreover, in certain constrained
environments, the benchmark runs and results are implicitly
limited. For example, in some large supercomputer systems,
application developers execute their applications with small
problem sizes on small number of processors of a special queue
called debug queue for development, tuning and debugging
purposes before performing large scale production runs on
large number of processors of production queues. The debug
runs are limited and are performed for very small number
of problem and system size configurations. A performance
modeling system for predicting performance of production
runs will have to be built using the limited debug runs.
In this work, we have developed a prediction framework
for performance predictions of HPC applications using a single
small scale application execution. Our framework employs a
novel strategy of matching execution profiles of the differ-
ent phases of the parallel applications to parallel reference
kernels stored in a kernel database. The reference kernels
are standard benchmarks from diverse application domains
as prescribed by Colella’s seven dwarfs [8], Berkeley View’s
thirteen motifs [9], and TORCH testbed of computational
reference kernels [10]. Our framework provides a suite, RK-
suite, of implementations, execution profiles and performance
models of reference kernels. Specifically, RK-suite consists
of 1. a collection, RK-collection, of these reference kernel
implementations, 2. execution profiles, RK-profile, including
cache hits and misses, instruction mix etc. obtained using
benchmarking runs of the reference kernels for a finite set of
problem sizes and number of processors, and 3. a performance
model, RK-model, that can be used to predict execution times
of the kernel implementations for other problem sizes and
processors. We claim that such a RK-Suite can be useful for
a number of purposes including evaluations and comparisons
of the high performance computing systems by supercomputer
installations, and hardware and software tuning by the vendors.
For a given application executed with a small problem size
and number of processors, we collect the execution profiles
or execution signatures of the application, automatically de-
tect the significant phases of the application, and match the
normalized execution profiles of the phases and the reference
kernels. For example, one of the reference kernels in our RK-
collection is a parallel FFT implemented in the NAS Parallel
Benchmark (NPB) [11]. The FFT implementation is executed
with different problem sizes and number of processors and
the execution profiles, RK-profile, are collected for these runs.
The execution times are predicted for other problem sizes and
number of processors using RK-model. An application like
Community Earth System Model (CESM) [12] can consist
of FFT calculations as one of its phases. The normalized
2015 IEEE 29th International Parallel and Distributed Processing Symposium
1530-2075/15 $31.00 © 2015 IEEE
DOI 10.1109/IPDPS.2015.20
1161