CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 26:1328–1341 Published online 14 August 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3122 SPECIAL ISSUE PAPER A hardware counter-based toolkit for the analysis of memory accesses in SMPs Oscar G. Lorenzo * ,† , Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel, Juan A. Lorenzo and Francisco F. Rivera Centro de Investigación en Tecnoloxías da Información (CITIUS), University of Santiago de Compostela, Santiago de Compostela, Spain SUMMARY In this paper, a set of three hardware counter (HC)-based tools to characterise memory access of parallel codes in Symmetric Multiprocessors (SMPs) is presented. This toolkit simplifies accessing and program- ming HCs, which are included in modern microprocessors. Hardware counters are used to obtain information about memory accesses in a parallel code at very low cost. This information is presented to the user in a friendly way. The first tool can be used to automatically monitor the memory accesses of a system and to analyse a code even if the source is not available. The second tool allows the user to insert in a source code, in a simple and transparent way, the instructions needed to monitor and manage HCs. This way, specific parts of the code can be analysed. The user can either add appropriate directives to a C code or use a graphical interface to select those parts of the code to be analysed. The tool takes this source file and automatically adds the monitoring code. The third tool takes the information gathered by the aforementioned tools, pro- cesses it and displays it graphically. This tool shows the information in a comprehensive and simple way, allowing the user to adjust the level of detail. The aim of these tools was to characterise the memory accesses of parallel codes in multicore systems, in which the cache hierarchy can greatly influence the performance. For illustrative purposes, these tools were used to carry out two case studies, a sparse matrix vector product and a dot product. These studies have been made in two different environments. Anyway, they can be used in almost any system as long as the necessary HCs are available. Copyright © 2013 John Wiley & Sons, Ltd. Received 2 November 2012; Revised 12 July 2013; Accepted 23 July 2013 KEY WORDS: hardware counters; parallel codes; monitoring; memory hierarchy; irregular codes 1. INTRODUCTION The behaviour of memory accesses is one of the most significant aspects influencing the perfor- mance of any code. This fact is more and more relevant as the memory wall increases [1]. One area where the memory management and utilisation is specially important is that of parallel and distributed systems, and, in particular, in current multicore architectures. For a parallel code to be correctly and efficiently executed, its programming must be careful. Taking into account architectural features, particularly the behaviour of memory accesses, is critical to improve locality among accesses and affinity between data and processors. Understanding the performance of a program requires considering several factors, such as the underlying system or the type of workload, which can lead to bottlenecks, or parts of the code where most of the time *Correspondence to: Oscar G. Lorenzo, Centro de Investigación en Tecnoloxías da Información (CITIUS), University of Santiago de Compostela, Spain. E-mail: oscar.garcia@usc.es Copyright © 2013 John Wiley & Sons, Ltd.