Trace-Based Parallel Performance Overhead Compensation Felix Wolf 1 , Allen D. Malony 2 , Sameer Shende 2 , and Alan Morris 2 1 Innovative Computing Laboratory, University of Tennessee fwolf@cs.utk.edu 2 Department of Computer and Information Science, University of Oregon {malony, morris, sameer}@cs.uoregon.edu Abstract. Tracing parallel programs to observe their performance introduces in- trusion as the result of trace measurement overhead. If post-mortem trace analy- sis does not compensate for the overhead, the intrusion will lead to errors in the performance results. We show that measurement overhead can be accounted for during trace analysis and intrusion modeled and removed. Algorithms developed in our earlier work [5] are reimplemented in a more robust and modern tool, KO- JAK [12], allowing them to be applied in large-scale parallel programs. The ability to reduce trace measurement error is demonstrated for a Monte-Carlo simulation based on a master/worker scheme. As an additional result, we visualize how local perturbation propagates across process boundaries and alters the behavioral char- acteristics of non-local processes. Keywords: Performance measurement, analysis, parallel computing, tracing, mes- sage passing, overhead compensation. 1 Introduction Trace-based measurement is used to observe the performance of a parallel program when one wants to see the interoperation of multiple threads or processes of execution, as it is recorded in a time-sequence trace of events. Any performance measurement, tracing included, will introduce overhead during program execution due to extra code being executed and hardware resources (processor, memory, network) consumed. When performance overhead affects the program execution, we speak of performance (mea- surement) intrusion. Performance intrusion, no matter how small, can result in perfor- mance perturbation [6] where the program’s measured performance behavior is “differ- ent” from its unmeasured performance. Whereas performance perturbation is difficult to assess, performance intrusion can be quantified by several metrics, the most important of which is dilation in program execution time. This type of intrusion is often reported as a percentage slowdown of total execution time, but the intrusion effects themselves will be distributed throughout the performance results. In the case of tracing, we will also see performance error due to intrusion (i.e., performance perturbation) in the tim- ings of the interdependent events between the processes. Of course, we cannot compare the measured parallel execution with the “real” par- allel execution to determine the intrusion error because we do not have any information L.T. Yang et al. (Eds.): HPCC 2005, LNCS 3726, pp. 617–628, 2005. c Springer-Verlag Berlin Heidelberg 2005