Future Generation Computer Systems 24 (2008) 121–132 www.elsevier.com/locate/fgcs High-level application-speciﬁc performance analysis using the G-PM tool ✩ Roland Wism¨ uller a,∗ , Marian Bubak b , Wlodzimierz Funika c a BSVS, University of Siegen, Germany b Institute of Computer Science AGH-UST, Academic Computer Centre — CYFRONET, Krak´ ow, Poland c Institute of Computer Science AGH-UST, Krak´ ow, Poland Received 26 September 2006; received in revised form 31 January 2007; accepted 26 March 2007 Available online 6 April 2007 Abstract The paper presents an approach to overcome a traditional problem of parallel performance analysis tools: performance data often are too low level and cannot easily be mapped to the application’s code structure, e.g. its execution phases. The G-PM tool offers the user an easy but ﬂexible means to deﬁne her/his own high-level, application speciﬁc metrics based on existing metrics and application events. We discuss the basic concepts of G-PM from the user’s point of view, its design, and some implementation issues, including the language PMSL which supports the speciﬁcation of user-deﬁned metrics. In the main part of the paper, we present a case study based on a real world medical application from the EU funded CrossGrid project, which demonstrates the concept of user-deﬁned metrics as well as its usefulness in practice. c  2007 Elsevier B.V. All rights reserved. 1. Introduction Most of today’s applications that require high computing performance are based on parallel programming using the message passing paradigm, as it is supported by MPI [17]. For this class of applications, tools that allow us to measure and improve their performance characteristics are vital for the applications’ success. Generally, performance analysis tools can be based on three different techniques: tracing, proﬁling and online analysis. With tracing, performance analysis is done in two steps: while the application is executing, relevant events (such as the beginning and the end of a call to the MPI Send() communication routine) and their time stamps are written to a ﬁle. In a subsequent ofﬂine step, different performance metrics (e.g. time spent in communication) can be computed from this trace ﬁle. Proﬁling avoids the necessity to store large trace ﬁles by computing a predeﬁned set of metrics online, during the application’s execution. These metrics typically are ✩ Partially funded by the European Commission (project IST-2001-32243, CrossGrid) and KBN (grant 4 T11C 032 23). ∗ Corresponding address: Operating Systems and Distributed Systems BSVS, University of Siegen, Holderlinstr. 3, 57068 Siegen, Germany. Tel.: +49 271 740 4050; fax: +49 271 740 4049. E-mail addresses: roland.wismueller@uni-siegen.de (R. Wism¨ uller), bubak@agh.edu.pl (M. Bubak), funika@agh.edu.pl (W. Funika). summaries over the whole execution. Online analysis can be viewed as a compromise between proﬁling and tracing, since – as with proﬁling – the tool computes performance metrics online, while on the other hand, – as with tracing – the information is still resolved in time. Different from both the other approaches, online analysis tools present the performance results while the application is executing and allow deﬁnition of new measurements based on these results. Today, there is already a number of sophisticated perfor- mance tools supporting the analysis of parallel applications. In the report [14] the authors list 26 performance related tools just in the context of grid computing. However, even with these tools it is still difﬁcult for programmers to optimize their appli- cations based on the provided information. This has two major reasons: • First, the information is often too low-level, since it is usually related to communication or even hardware events. For example, tools for MPI typically provide the time spent in MPI Barrier() or MPI Recv(), but they fail to provide information about load imbalance. This is because in general the way for measuring the metrics “load imbalance” is application speciﬁc. While in shared memory applications, load imbalance can usually be measured by comparing the waiting time at a barrier in the individual threads, message passing applications can also synchronize via messages. In 0167-739X/$ - see front matter c  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2007.03.008