Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors Yan Solihin, Vinh Lam, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign, IL 61801 solihin,lam,torrellas@cs.uiuc.edu http://iacoma.cs.uiuc.edu Abstract Distributed Shared-Memory (DSM) multiprocessors provide an attractive combination of cost-effective commodity archi- tecture and, thanks to the shared-memory abstraction, relative ease of programming. Unfortunately, it is well known that tun- ing applications for scalable performance in these machines is time-consuming. To address this problem, programmers use performance monitoring tools. However, these tools are often costly to run, especially if highly-processed information is de- sired. In addition, they usually cannot be used to experiment with hypothetical architecture organizations. In this paper, we present Scal-Tool, a tool that isolates and quantifies scalability bottlenecks in parallel applications run- ning on DSM machines. The scalability bottlenecks currently quantified include insufficient caching space, load imbalance, and synchronization. The tool is based on an empirical model that uses as inputs measurements from hardware event coun- ters in the processor. A major advantage of the tool is that it is quite inexpensive to run: it only needs the event counter values for the application running with a few different proces- sor counts and data set sizes. In addition, it provides ways to analyze variations of several machine parameters. 1 Introduction Distributed Shared-Memory (DSM) multiprocessors are tak- ing an increasing share of the market for medium- and large- scale parallel processing. These machines, built out of fast off- the-shelf microprocessors and sometimes even networks and nodes, exploit cost-effective commodity architectural designs. Examples of such machines are the SGI Origin 2000 [10], the HP-Convex Exemplar [1], the Sequent NUMAQ [11], the Data General NUMALiiNE [3], and the Sun WildFire [7]. In addition, it is argued that these machines deliver a pro- grammable environment. The reason is the simplicity of the shared-memory abstraction presented to the programmer and This work was supported in part by the National Science Foundation un- der grants NCSA-PACI ACI 96-19019COOP, NSF Young Investigator Award MIP-9457436, ASC-9612099, and MIP-9619351, gifts from Intel and IBM, and NCSA machine time under grant AST910367N. the existence of automatic parallelizing compilers that, at least in theory, relieve the programmer from having to explicitly parallelize the codes. In practice, it is well known that tuning applications for scalable performance in these machines is often a time- consuming effort. Indeed, the shared-memory abstraction hides widely-different costs, depending on whether the mem- ory location accessed is cached or, if not, exists in local or in remote memory. Furthermore, state-of-the-art automatic par- allelizing compilers still have considerable limitations. As a result, computer manufacturers have developed a set of performance monitoring tools to assist the programmer in performance-tuning the applications. For example, SGI pro- vides, among other tools: perfex [18], which is a set of hard- ware event counters in the processor that can record up to 32 hardware events; speedshop [19], which profiles the cy- cles spent in each routine of the application or library; and ssusage [20], which measures the maximum number of pages that an application has at any time in memory. With these tools, the programmer can find bottlenecks in the code and tune it for better scalability. Unfortunately, these tools are often costly to run, especially if we want to obtain highly-processed information. For ex- ample, suppose that we want to measure the total time that a parallel program spends synchronizing or spinning due to load imbalance, for different processor counts (1, 2, 4, ... ). For each processor count, we run time and the more intrusive speedshop. We use time to measure the execution time, while we use speedshop to measure the fraction of the cycles spent in the synchronization and spinning routines. Table 1 lists the number of runs that must be performed, the total number of processors required, and the number of output files that must be analyzed. Although the number of files could be reduced by generating a single file in every speedshop run, all these activities clearly involve significant effort and resource usage. In addition, most performance monitoring tools can only measure statistics on the actual machine. As a result, they can rarely be used to experiment with hypothetical architec- ture organizations. For example, it is usually hard to estimate the effect of doubling the L2 cache size on application perfor- mance. 1