MIAMI: A Framework for Application Performance Diagnosis Gabriel Marin Innovative Computing Laboratory University of Tennessee gmarin@utk.edu Jack Dongarra Innovative Computing Laboratory University of Tennessee dongarra@utk.edu Dan Terpstra Innovative Computing Laboratory University of Tennessee terpstra@utk.edu Abstract—A typical application tuning cycle repeats the fol- lowing three steps in a loop: performance measurement, analysis of results, and code refactoring. While performance measurement is well covered by existing tools, analysis of results to understand the main sources of inefﬁciency and to identify opportunities for optimization is generally left to the user. Today’s state of the art performance analysis tools use instrumentation or hardware counter sampling to measure the performance of interactions between code and the target architecture during execution. Such measurements are useful to identify hotspots in applications, places where execution time is spent or where cache misses are incurred. However, explanatory understanding of tuning opportunities requires a more detailed, mechanistic modeling approach. This paper presents MIAMI (Machine Independent Application Models for performance Insight), a set of tools for automatic performance diagnosis. MIAMI uses application characterization and models of target architectures to reason about an application’s performance. MIAMI uses a modeling approach based on ﬁrst-order principles to identify performance bottlenecks, pinpoint optimization opportunities, and compute bounds on the potential for improvement. I. I NTRODUCTION Investments in high performance computing (HPC) systems stand at tens of millions of dollars each year. These systems have tremendous peak performance potential, as demonstrated by their throughput results with highly optimized, dense linear algebra kernels [23]. However, most scientiﬁc simulations run at only a fraction of theoretical system peak speed. This large unfulﬁlled performance potential is due in part to compilers and application developers not being able to harness the potential of the architectures and in part due to an imbalance between the resources offered by current systems and the actual needs of applications. To close this performance gap, application developers must precisely understand what factors are limiting the performance of their codes, a process known as performance diagnosis. Performance diagnosis is the ﬁrst step, and at the same time, the most difﬁcult step of any performance optimization effort, just as understanding the causes behind a program crashing or producing incorrect results is the most important and the most difﬁcult step of any program debugging effort. Once we identify the factors that limit performance, the code transformations required to alleviate the detected performance bottlenecks become more easily apparent. State of the art performance analysis tools in use today use either caliper-based hardware counter measurements [3], [21] or hardware counter sampling [1], [10], [20] to measure application performance during execution. A strength of hard- ware performance counters is that they can observe phenomena that cannot be measured directly otherwise. However, hardware counters can only observe performance effects, the result of interactions between code and target architecture. A process of deconvolution through which we can attribute parts of the observed effects to speciﬁc application and architectural factors is needed to perform root cause analysis from hardware counter measurements. While certain correlations between application or architectural factors and the observed performance effects can be established, the process requires high levels of user expertise and a signiﬁcant amount of guesswork. To provide the kind of feedback that we think is necessary, tools must identify and model in isolation the application and architectural factors that are important for performance, e.g. the application instruction mix, the instruction schedule dependencies, the type of resources available on a target architecture and the type of resources required by each basic operation during execution, they must understand how data is reused and the patterns with which an application traverses memory. Performance diagnosis tools must then use a per- formance convolution process based on ﬁrst-order principles to understand the factors that are limiting performance at each point in an application. An estimate of the maximum potential for improvement can be computed by idealizing the limiting factors and reapplying the convolution process. Finally, understanding what factors are limiting performance, directly determines the type of code or architectural changes that are needed to alleviate that bottleneck. In some instances, such transformations are not possible, or they are prohibitively expensive. However, it is very useful for a user to identify such situations, so as to understand when to stop optimizing. Providing users with an accurate trade-off of costs, i.e. the type of transformations that are required, and beneﬁts, e.g. the potential for performance gains, enables users to make informed decisions about where to focus their tuning efforts. To be useful, tools must automate as much of this process as possible. They must work on full applications instead of requiring users to extract “interesting kernels,” and they must be able to handle interactions between application code and system libraries. Because performance depends also on the quality of the code produced by the compiler, tools should try to observe the effect of optimizations while not perturbing the optimization process. For these reasons, we think that the best way to perform performance diagnosis is by analyzing appli- cation executables. In addition, tools that work on binaries can naturally handle applications written in different programming languages or using different programming models. In this paper, we present MIAMI (Machine Independent Application Models for performance Insight), a set of ex- tensible tools for automatic performance diagnosis. MIAMI analyzes fully optimized x86 application binaries to construct a machine-independent understanding of an application’s al-