Performance Analysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using the Parsec v2.0 Benchmark Suite Javier Lira 1 Carlos Molina 2 and Antonio Gonz´alez 34 Abstract — Non-Uniform Cache Architectures (NUCA) have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides non-uniform access latencies and thus faster access to those banks that are close to the processor. A NUCA model can be characterized according to the four policies that determine its behavior: bank placement, bank access, bank migration and bank replacement. Placement determines the first location of data, access defines the searching algorithm across the banks, migration decides data movements inside the memory and replacement deals with the evicted data. This paper analyzes the performance of several alternatives that can be considered for each of these four policies. Moreover, the Parsec v2.0 benchmark suite has been used to handle this evaluation because it is a representative group of upcoming shared-memory programs for Chip Multiprocessors. The results may help researchers to identify key features of NUCA organizations and to open up new areas of investigation. I. Introduction The continuing technological advances in the scale of integration have ensured that the number of transistors that can be integrated into a single chip will double every two years. This prediction, known as Moore’s Law [1], has been in place 40 years and it is widely accepted that this trend will continue over the next 10-15 years. Therefore, future processors will have billions of tiny transistors. Against this background, an important question that arises is how current processors can efficiently use this technology. Chip Multiprocessors (CMPs) have emerged as a dominant paradigm in system design [2], [3]. Several commercial microprocessors are beginning to include multiple cores (2 to 8, depending on the model) with a shared cache. Moreover, as we increase the scale of integration, the chips include more and more cores, which could lead to 64 processor cores being placed on a chip by the middle of the next decade.These multicore systems incorporate larger and shared second-level caches with a homogeneous access time. However, traditional cache architectures assume that each level in the cache hierarchy has a single and uniform cache access time, but the increasing communication delay causes the hit time of large on-chip caches to be a function of a line’s physical location within the cache. Consequently, cache access time becomes a continuum of latencies rather than a single discrete latency [4], [5]. Non- 1 Universitat Polit` ecnica de Catalunya, jlira@ac.upc.edu 2 Universitat Rovira i Virgili, carlos.molina@urv.net 3 Intel Barcelona Research Center 4 Intel Labs - UPC, antonio.gonzalez@intel.com Uniform Cache Architecture (NUCA), that was first proposed by Kim et al [6], exploits this non- uniformity to provide master access to cache lines in those portions of the cache that are closer to the processor. The underlying concept behind a NUCA system involves dividing the whole cache into smaller banks. Each of these banks traditionally has a single discrete latency, although this is much smaller than it would be if the whole cache was a uniform cache. Data are distributed among all the banks, so the total latency for getting a single piece of data from a processor includes the requesting and the responding routing time from the processor to the bank containing the requested data plus the latency of the bank. A NUCA model can be characterized by the following four policies that are involved in its behavior: Bank Placement Policy, Bank Access Policy, Bank Migration Policy, and Bank Replacement Policy. This paper aims to analyze how NUCA performs on a Chip Multiprocessor using the Parsec v2.0 benchmark suite. Starting from a base configuration, it attempts to show the potential of each of the four policies that characterizes the behavior in a NUCA system. In this way, several alternatives for each policy will be described, evaluated and discussed. The remainder of this paper is structured as follows. Section II presents the baseline model that has been assumed, the simulation tools and a brief description of the benchmarks used. Section III describes the alternatives of each bank policy considered in this study. Section IV presents the results obtained during the simulations. Related work is summarized in Section V. Finally, Section VI outlines the main conclusions of this work. II. Experimental Framework A. Baseline Model This paper deals with a L2 NUCA organization based on that proposed by Beckmann and Wood [7]. Figure 1 illustrates this baseline model. A die with 8 cores on the edges and a shared L2 NUCA cache in the center has been assumed. Each core maintains its own private first level cache that is divided for data and instructions. First level caches are 2-way set associative while L2 NUCA cache is 8 way set associative. The MOESI coherence protocol maintains correctness and robustness in the memory system. Moreover, the length of the wire that connects the NUCA with the third level of the