Future Generation Computer Systems 24 (2008) 121–132 www.elsevier.com/locate/fgcs High-level application-specific performance analysis using the G-PM tool Roland Wism¨ uller a, , Marian Bubak b , Wlodzimierz Funika c a BSVS, University of Siegen, Germany b Institute of Computer Science AGH-UST, Academic Computer Centre — CYFRONET, Krak´ ow, Poland c Institute of Computer Science AGH-UST, Krak´ ow, Poland Received 26 September 2006; received in revised form 31 January 2007; accepted 26 March 2007 Available online 6 April 2007 Abstract The paper presents an approach to overcome a traditional problem of parallel performance analysis tools: performance data often are too low level and cannot easily be mapped to the application’s code structure, e.g. its execution phases. The G-PM tool offers the user an easy but flexible means to define her/his own high-level, application specific metrics based on existing metrics and application events. We discuss the basic concepts of G-PM from the user’s point of view, its design, and some implementation issues, including the language PMSL which supports the specification of user-defined metrics. In the main part of the paper, we present a case study based on a real world medical application from the EU funded CrossGrid project, which demonstrates the concept of user-defined metrics as well as its usefulness in practice. c 2007 Elsevier B.V. All rights reserved. 1. Introduction Most of today’s applications that require high computing performance are based on parallel programming using the message passing paradigm, as it is supported by MPI [17]. For this class of applications, tools that allow us to measure and improve their performance characteristics are vital for the applications’ success. Generally, performance analysis tools can be based on three different techniques: tracing, profiling and online analysis. With tracing, performance analysis is done in two steps: while the application is executing, relevant events (such as the beginning and the end of a call to the MPI Send() communication routine) and their time stamps are written to a file. In a subsequent offline step, different performance metrics (e.g. time spent in communication) can be computed from this trace file. Profiling avoids the necessity to store large trace files by computing a predefined set of metrics online, during the application’s execution. These metrics typically are Partially funded by the European Commission (project IST-2001-32243, CrossGrid) and KBN (grant 4 T11C 032 23). Corresponding address: Operating Systems and Distributed Systems BSVS, University of Siegen, Holderlinstr. 3, 57068 Siegen, Germany. Tel.: +49 271 740 4050; fax: +49 271 740 4049. E-mail addresses: roland.wismueller@uni-siegen.de (R. Wism¨ uller), bubak@agh.edu.pl (M. Bubak), funika@agh.edu.pl (W. Funika). summaries over the whole execution. Online analysis can be viewed as a compromise between profiling and tracing, since – as with profiling – the tool computes performance metrics online, while on the other hand, – as with tracing – the information is still resolved in time. Different from both the other approaches, online analysis tools present the performance results while the application is executing and allow definition of new measurements based on these results. Today, there is already a number of sophisticated perfor- mance tools supporting the analysis of parallel applications. In the report [14] the authors list 26 performance related tools just in the context of grid computing. However, even with these tools it is still difficult for programmers to optimize their appli- cations based on the provided information. This has two major reasons: First, the information is often too low-level, since it is usually related to communication or even hardware events. For example, tools for MPI typically provide the time spent in MPI Barrier() or MPI Recv(), but they fail to provide information about load imbalance. This is because in general the way for measuring the metrics “load imbalance” is application specific. While in shared memory applications, load imbalance can usually be measured by comparing the waiting time at a barrier in the individual threads, message passing applications can also synchronize via messages. In 0167-739X/$ - see front matter c 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2007.03.008