Predicting Performance on SMPs. A Case Study: The SGI Power Challenge Nancy M. Amato Jack Perdue Andrea Pietracaprina Geppino Pucci Mark Mathis Department of Computer Science Texas A&M University, College Station, TX, USA. amato,jkp2866,mmathis @cs.tamu.edu Dipartimento di Elettronica e Informatica Universit` a di Padova, Italy. andrea,geppo @artemide.dei.unipd.it Abstract We study the issue of performance prediction on the SGI- Power Challenge, a typical SMP. On such a platform, the cost of memory accesses depends on their locality and on contention among processors. By running a carefully de- signed suite of microbenchmarks, we provide quantitative evidence that memory hierarchy effects impact performance far more substantially than other phenomena related to con- tention. We also fit three cost functions based on variants of the BSP model, which do not account for the hierarchy, and a newly defined function F, expressed in terms of hardware counters, which captures both memory hierarchy and con- tention effects. We test the accuracy of all the functions on both synthetic and application benchmarks showing that, unlike the other functions, F achieves an excellent level of accuracy in all cases. Although hardware counters are only available at run time, we give evidence that function F can still be employed as a prediction tool by extrapolating val- ues of the counters from pilot runs on small input sizes. 1 Introduction Despite the vast body of ingenious parallel algorith- mic techniques developed over the last two decades, the widespread use of parallel computers is still hampered by the difficulty of exploiting their massive computational po- tential to an extent that warrants their large cost. Indeed, it has often been noted that theoretically efficient algorithms This research was supported in part by NATO CRG 961243 “Bulk Synchronous Computational Geometry,” and by NCSA grant CCR970010N. The work at Texas A&M was also supported by the NSF CAREER award CCR-9624315 and grants IRI-9619850, ACI-9872126, EIA-9805823, EIA-9810937, by DOE ASCI ASAP (Level 2 Program) grant B347886, and by the Texas Higher Education Coordinating Board grant ARP-036327-017. Perdue and Mathis supported in part by Dept. of Education Graduate Fellowships. The work at Padova was also supported by MURST of Italy under project “Algorithms for Large Data Sets: Sci- ence and Engineering.” exhibit poor performance when implemented on real ma- chines. Very often, this is due to the inadequacies of the cost functions employed to predict performance, which do not properly account – or totally disregard – aspects of the machine that have a major impact on performance. Although much progress has been made, the develop- ment of adequate tools for predicting actual performance on real machines remains one of the most challenging prob- lems in parallel processing. We believe that further progress towards this goal requires a tighter coupling of cost models to architectures than has been previously employed. The issue of predictivity is especially challenging for the class of Symmetric MultiProcessors (SMPs). These widely spread parallel platforms are built upon power- ful off-the-shelf microprocessors interacting through a dis- tributed shared-memory via a communication medium, typ- ically a bus. In such a system, the cost of an access to a shared datum may vary dramatically: from a few cycles if the data is in first-level cache (L1), to tens of cycles for second-level (L2) cache, to hundreds of cycles if the data must be accessed from main memory. The cost may be even greater in the presence of high contention among the processors for the bus or memory banks, or of (false) data sharing. Our Contribution In this paper, we study the relative im- pact on performance of hierarchy and contention phenom- ena on an SGI-Power Challenge (SGI-PC), which is a typ- ical representative of the class of SMPs. More specifically, we present a suite of synthetic microbenchmarks which ex- ercise different usages of the hierarchy under a set of con- trolled scenarios obtained by varying the level and type of contention among the processors. Based on the access times measured through the microbenchmarks, we infer parame- ter values for a set of linear cost functions inspired by some variants of the popular Bulk-Synchronous Parallel (BSP) model [14], and of a newly defined function which relies on the MIPS R10000 hardware counters describing the mem- ory hierarchy usage of a program. While the BSP-derived