Evaluating Architecture and Compiler Design through Static Loop Analysis Yuriy Kashnikov, Pablo de Oliveira Castro, Emmanuel Oseret, and William Jalby Exascale Computing Research – University of Versailles, France yuriy.kashnikov@exascale-computing.eu, pablo.oliveira@exascale-computing.eu, emmanuel.oseret@exascale-computing.eu, william.jalby@exascale-computing.eu Abstract—Using the MAQAO loop static analyzer, we char- acterize a corpus of binary loops extracted from common benchmark suits such as SPEC, NAS, etc. and several industrial applications. For each loop, MAQAO extracts low-level assembly features such as: integer and floating-point vectorization ratio, number of registers used and spill-fill, number of concurrent memory streams accessed, etc. The distributions of these features on a large representative code corpus can be used to evaluate compilers and architectures and tune them for the most fre- quently used assembly patterns. In this paper, we present the MAQAO loop analyzer and a characterization of the 4857 binary loops. We evaluate register allocation and vectorization on two compilers and propose a method to tune loop buffer size and stream prefetcher based on static analysis of benchmarks. Index Terms—Benchmarking and Assessment; Software Moni- toring and Measurement; HPC Monitoring and Instrumentation; Modeling, Simulation and Evaluation Techniques I. I NTRODUCTION To be efficient on a large set of applications, new architec- tures and compilers must be tested on as many benchmarks as possible. The cost of benchmarking grows with the number of benchmarks considered. Testing a new architecture or compiler on thousand of benchmarks may not be feasible. In this paper, we propose to use fast static analysis as a preliminary step in the tuning process. Compilers must be tuned to better harness new architec- tural features. To efficiently generate the code for a given architecture, compilers use an approximate model of the architecture and profitability heuristics, which guide the code transformations. Usually, tuning the compiler heuristics is a tedious and error prone manual work done by compiler developers. Our approach simplifies this process by identifying loops with sub-optimal code (eg. badly vectorized, too big for the architecture loop buffer). The proposed static analysis gives a quick estimate of the generated code quality and therefore could be transparently integrated into an automated compiler evaluation process. We demonstrate the applicability of the proposed approach for two compiler transformations: register allocation and vectorization. For architectures, we consider the problem of selecting good hardware parameters as the number of virtual registers, the loop buffer size or the kind and number of dispatch ports. Finding the sweet-spot for these parameters through benchmarking may require many iterations. The number of testing iterations could be reduced if an oracle provided good enough initial values. We propose a method to derive statically good initial candidates for hardware parameters. First, we collect a body of 4857 binary loops extracted from a set of benchmarks and real industrial applications. Each loop is dissected with the MAQAO static loop analyzer [1], [2], which extracts a large set of low level characteristics such as: number of register used, number of values read in the stack, vectorization ratio, and pressure on dispatch ports. MAQAO statistics are aggregated for all the loops in our database. To get a realistic profile, we weight loops features proportionally to their running time and discard loops that have little impact on the application running time. This analysis allows to profile the assembly characteristics of many different code fragments. By using a very large set of benchmarks, we can identify the most frequent micro-architectural bottlenecks. The empirical distributions of these assembly features are used to select good candidates for sizing architecture parame- ters. Knowing the distribution of the used registers and spill- fill per loop can be used to evaluate the pay-off (cost) of adding (removing) registers in an architecture. For example, the number of register could be selected so that 90% of the loops are free of registers spills. The port pressure distribution, may help decide how many arithmetic, load and store ports are needed to balance dispatch rate. Of course, this decision must be weighted against the power, silicium and complexity cost of adding new dispatch ports to an architecture. But knowing the empirical percentage of benchmarks that could benefit from a new feature helps focusing on big payoff features. Our method is based on the analysis of the assembly code, which is usually produced by a compiler (except for some manually optimized assembly sections in highly specific programs), therefore the compiler effect must be taken into account. Our approach estimates the impact of new architecture features for a fixed binary or the impact of different compilers for a fixed target architecture. The main contributions of this paper are: 1) a profile of low-level characteristics of a large represen- tative body of applications 2) an evaluation of compiler’s register allocation and vec- torization 3) a static assembly characterization methodology to tune the loop buffer size and the hardware prefetcher Section II presents the MAQAO static loop analyzer ar- chitecture and explains how the assembly features are ex-