Cache Miss Characterization in Hierarchical Large-Scale Cache-Coherent Systems Alberto Ros ∗ , Blas Cuesta †§ , Mar´ ıa E. G´ omez ‡ , Antonio Robles ‡ , Jos´ e Duato ‡ ∗ Departamento de Ingenier´ ıa y Tecnolog´ ıa de Computadores Universidad de Murcia, 30100 Murcia (Spain) E-mail: aros@ditec.um.es † Intel Labs Barcelona E-mail: blasx.cuesta@intel.com,blacuesa@gap.upv.es ‡ Department of Computer Engineering Universitat Polit` ecnica de Val` encia, 46021 Valencia (Spain) E-mail: {megomez,arobles,jduato}@gap.upv.es Abstract—There is a growing trend towards developing large- scale cache-coherent systems by using commodity symmetric multiprocessors, which requires to extend their coherence proto- col. In such systems, cache coherence transactions issued due to cache misses traverse interconnection networks with very different topologies and latencies. In this work, we perform a cache miss characterization aimed at analyzing the beneﬁts that can be expected for a specialized coherence controller able to locally resolve cache misses, thus saving trafﬁc across long-latency links. Results show that there is a high potential in reducing miss latency in these systems, and that this potential reduction grows as the number of nodes in the system increases. Particularly, in a system with just two boards 40% of the cache misses do not need the expensive inter-board communication. This percentage can increase up to 67.5% for an 8-board system. I. I NTRODUCTION Until recently, many service providers were able to use clus- ters of PCs for high performance computing (HPC). This kind of clusters usually relies on message-passing communications for remote memory accesses, which not only increases the communication latencies, but also difﬁculties the developing of efﬁcient applications when compared to the shared-memory programing model. These drawbacks highlight the need for large-scale cache-coherent systems. There is a current trend towards developing such large-scale cache-coherent systems based on using existing commodity symmetric multiprocessors (SMP), which requires to extend their coherence protocol. AMD was the ﬁrst to include such features in their Opteron processors. Particularly, the six- and twelve-core versions of AMD Opteron processors, codenamed Istanbul and Magny-Cours [1] respectively, can be intercon- nected to compound a larger system while still maintaining cache coherence thanks to the Coherent HyperTransport (cHT) technology [2]. Similarly, the Intel’s QuickPath Interconnect § This work was done before the author joined Intel, while being at the Universitat Polit` ecnica de Val` encia. (QPI) allows several Nehalem processors to compound a larger coherent system. In order to increase even more the number of processor cores that can be kept coherent in such systems, several proposals aimed at further extending the coherence domain have appeared recently. We can ﬁnd examples of these systems either in the market (e.g., Horus [3] and SGI Altix UV [4]) or in the literature (e.g., EMC 2 [5], [6]). These hierarchical systems have very different communication latencies among processing cores depending on the distance, the interconnec- tion technology, and its level in the coherence hierarchy, as we can see in Figure 1. The basic building block is the die, that can comprise several processor cores (currently from 4 to 12). Communication among these cores is very fast (just a few nanoseconds) and can be carried out by a shared bus. Several dies can be placed in the same board in order to compound a larger system. Communication among dies is commonly performed through a scalable point-to-point interconnect (e.g., cHT or QPI), and usually requires tens of nanoseconds [1]. Finally, several boards can be connected by an InﬁniBand [7] or Ethernet switch fabric. The component responsible for managing communication between internal (intra-board) and external (inter-board) messages is the bridge chip (also named as HORUS chip in [3], as UV HUB in [4], and as EMC 2 in [8]). Communication latency across the inter-board network can be higher than one microsecond [9]. Since in these systems the inter-board communication latency is extremely high when compared to the other network latencies, the avoidance of this communication becomes a fundamental goal for delivering high performance. In this paper, we present a characterization of the cache misses that require coherence transactions among dies or boards. This characterization represents the ﬁrst and funda- mental step of a work in progress whose ﬁnal goal is the design of a cache coherence protocol able to make the most of the hierarchical systems. In particular, we are interested 2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications 978-0-7695-4701-5/12 $26.00 © 2012 IEEE DOI 10.1109/ISPA.2012.102 691