1556-6056 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2014.2321398, IEEE Computer Architecture Letters Profiling Support for Runtime Managed Code: Next Generation Performance Monitoring Units Enric Gibert, Raúl Martínez, Carlos Madriles, Josep M. Codina Intel Barcelona Research Center (IBRC), Intel Labs, Intel Corp. email: {enric.gibert.codina, raul.martinez, carlos.madriles.gimeno, josep.m.codina}@intel.com Abstract—Given the increase of runtime managed code environments in desktop, server, and mobile segments, agile, flexible, and accurate performance monitoring capabilities are required in order to perform wise code transformations and optimizations. Common profiling strategies, mainly based on instrumentation and current Performance Monitoring Units (PMUs), are not adequate and new innovative designs are necessary. In this paper, we present the desired characteristics of what we call Next Generation PMUs and advocate for hardware / software collaborative approaches where hardware implements the profiling hooks and mechanisms and software implements the complex heuristics. We then propose a first design in which the hardware uses a small, yet flexible table to profile specific code regions and the software decides what / when / how to profile. This first design meets all required features and we aim it as the seed for future PMUs extensions to enable novel dynamic code transformations and optimizations. Index Terms—Performance Monitoring Unit (PMU), profiling, runtime managed code, Just in Time (JIT) compiler —————————— —————————— 1 INTRODUCTION PPLICATION profiling is the action of inferring execution characteristics of a program while it is being executed. The main profiling goal has been, at least in industry, to generate new application binaries offline based on a previous execution profile in order to improve their performance and/or energy consump- tion. With the massive introduction of runtime managed environ- ments such as Java Runtime Environment (JRE)* or Common Lan- guage Runtime (CLR)*, and powerful scripting languages such as PHP, Python or JavaScript, this profiling paradigm needs to be more dynamic. In these managed environments, applications, their modules, dynamic libraries, and scripts are loaded, com- piled, linked, and optimized at runtime. Thus, profiling infor- mation needs to be collected quickly to allow the runtime to react and generate efficient code on-the-fly. In addition, with the intro- duction of speculative execution support in main stream proces- sors [10][8], dynamic speculative optimizations based on profiling feedback are observed as one of the next big research topics in compiler technology. Common profiling strategies based on code instrumentation and/or current Profiling Monitoring Units (PMU) do not meet these aforementioned requirements and are not adequate for dy- namic (and sometimes speculative) code transformations and op- timizations. In this paper, we advocate for novel profiling tech- niques with a strong symbiosis between hardware and software. In particular, we identify the requirements for what we call Next Generation PMUs and present a first example of a profiling para- digm in this new research direction. 2 CURRENT PROFILING SCHEMES Program instrumentation has often been used to detect hot basic blocks or dynamic call traces. However, instrumentation has two drawbacks that limit its applicability for future code optimiza- tions in runtime managed environments. First, extra instructions are executed to obtain dynamic information, which lead to non- negligible performance and power overheads (basic instrumenta- tion normally adds around 4-5 instructions to each basic block to compute basic block weights or edge profiles). These overheads are still important even in schemes in which instrumentation is applied selectively [4] or in which more complex algorithms such as path profiling [5] are used to reduce the number of additional instructions (note that for the latter, the algorithm is part of the runtime). Second, instrumentation is limited to architectural events such as branch outcomes because it is not able to capture micro architectural events like cache misses and branch mispre- dictions. Most runtime managed environments today use instru- mentation to extract particular characteristics of a program [6][7][12][16][18] including method invocations, basic block counts, loop trip counts, and data types among others. On the other hand, although current Profiling Monitoring Units (PMUs) [11][9][3] can capture architectural and micro archi- tectural information, their modus operandi is based on sampling. In particular, the user programs the unit to generate an infor- mation record (normally stored in memory) after a specific event has occurred (e.g. cache miss) a certain number of times (normally in the order of thousands to reduce overheads). This record con- tains information about the instruction that generated the event including things such as its Program Counter and its latency. Ob- viously, collecting meaningful information based on sampling is a slow process and it is not adequate for dynamic optimizations - especially in the desktop and mobile segments where short appli- cations are common. For instance, knowing the data cache miss ratio for a given memory instruction at the granularity of 1/100 implies at least 100 profiling records in memory, which are often A * Other names and brands may be claimed as the property of others. others.