Online Performance Analysis by Statistical Sampling of Microprocessor Performance Counters Reza Azimi Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada azimi@eecg.toronto.edu Michael Stumm Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada stumm@eecg.toronto.edu Robert W. Wisniewski IBM T. J. Watson Research Lab Yorktown Heights New York, USA bobww@us.ibm.com Abstract Hardware performance counters (HPCs) are increasingly being used to analyze performance and identify the causes of performance bottlenecks. However, HPCs are difficult to use for several reasons. Microprocessors do not provide enough counters to simultaneously monitor the many different types of events needed to form an over- all understanding of performance. Moreover, HPCs primarily count low–level micro–architectural events from which it is difficult to extract high–level insight required for identifying causes of perfor- mance problems. We describe two techniques that help overcome these difficulties, allowing HPCs to be used in dynamic real–time optimizers. First, statistical sampling is used to dynamically multiplex HPCs and make a larger set of logical HPCs available. Using real programs, we show experimentally that it is possible through this sampling to obtain counts of hardware events that are statistically similar (within 15%) to complete non-sampled counts, thus allowing us to provide a much larger set of logical HPCs. Second, we observe that stall cycles are a primary source of inefficiencies, and hence they should be major targets for software optimization. Based on this observation, we build a simple model in real–time that specu- latively associates each stall cycle to a processor component that likely caused the stall. The information needed to produce this model is obtained using our HPC multiplexing approach to monitor a large number of hardware components simultaneously. Our anal- ysis shows even in an out–of–order superscalar micro–processor such a speculative approach yields a fairly accurate model with run– time overhead for collection and computation of under 2%. These results demonstrate that we can effective analyze on–line per- formance of application and system code running at full speed. The stall analysis shows where performance is being lost on a given pro- cessor. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS ´ 05, June 20-22, Boston, MA, USA. Copyright 2005, ACM 1-59593-167-8/06/2005...$5.00 1 Introduction Hardware Performance Counters (HPCs) are an integral part of modern microprocessor Performance Monitoring Units (PMUs). They can be used to monitor and analyze performance in real time. HPCs allow counting of detailed micro–architectural events in the processor [15, 24, 13, 2], enabling new ways to monitor and an- alyze performance. There has been considerable work that has used HPCs to explore the behavior of applications and identify per- formance bottlenecks resulting from excessively stressed micro– architecture components [1, 8, 25]. However, exploiting HPCs at run–time for dynamic optimization purposes has proven to be chal- lenging for a number of reasons: Limited Hardware Resources: PMUs typically have a small num- ber of HPCs (e.g. up to 8 in IBM PowerPC processors, 4 in Intel Itanium II, and 4 in AMD Athlon processors). As a result, only a limited number of low-level hardware events can be monitored at any given time. Moreover, only specific subsets of hardware events can be programmed to be counted together due to hardware– level programming constraints. This is a serious limitation con- sidering that detecting performance bottlenecks in complex super- scalar microprocessors often requires detailed and extensive per- formance knowledge of several processor components. One way to get around this limitation is to execute several runs of an applica- tion, each time with a different set of events being captured. Such an approach can become time–consuming for offline performance analysis, and is completely inappropriate for online analysis. Merg- ing the traces for offline analysis generated from several application runs is not straightforward, because there are asynchronous events (e.g. interrupts and I/O events) in each run that may cause signifi- cant timing drifts. Complex Interface: The events that can be monitored by HPCs are often low–level and specific to a micro–architecture implemen- tation and as a result, they are hard to interpret correctly without de- tailed knowledge of the micro-architecture implementation. In fact, in the processors we have considered most high–level performance metrics such as Cycles Per Instruction (CPI), cache miss ratio, and memory bus contention, can only be measured by carefully com- bining the occurrence frequencies of several hardware events. At best, this makes HPCs hard to use by average application develop- ers, but even for seasoned systems programmers, it is challenging to translate the frequency of particular hardware–level events to their actual impact on end performance due to the complexity of today’s micro–architectures. High Overhead: Because PMU resources are shared among all system processes, they can only be programmed in supervisor