Cache Miss Characterization in Hierarchical
Large-Scale Cache-Coherent Systems
Alberto Ros
∗
, Blas Cuesta
†§
, Mar´ ıa E. G´ omez
‡
, Antonio Robles
‡
, Jos´ e Duato
‡
∗
Departamento de Ingenier´ ıa y Tecnolog´ ıa de Computadores
Universidad de Murcia, 30100 Murcia (Spain)
E-mail: aros@ditec.um.es
†
Intel Labs Barcelona
E-mail: blasx.cuesta@intel.com,blacuesa@gap.upv.es
‡
Department of Computer Engineering
Universitat Polit` ecnica de Val` encia, 46021 Valencia (Spain)
E-mail: {megomez,arobles,jduato}@gap.upv.es
Abstract—There is a growing trend towards developing large-
scale cache-coherent systems by using commodity symmetric
multiprocessors, which requires to extend their coherence proto-
col. In such systems, cache coherence transactions issued due
to cache misses traverse interconnection networks with very
different topologies and latencies. In this work, we perform a
cache miss characterization aimed at analyzing the benefits that
can be expected for a specialized coherence controller able to
locally resolve cache misses, thus saving traffic across long-latency
links. Results show that there is a high potential in reducing miss
latency in these systems, and that this potential reduction grows
as the number of nodes in the system increases. Particularly, in
a system with just two boards 40% of the cache misses do not
need the expensive inter-board communication. This percentage
can increase up to 67.5% for an 8-board system.
I. I NTRODUCTION
Until recently, many service providers were able to use clus-
ters of PCs for high performance computing (HPC). This kind
of clusters usually relies on message-passing communications
for remote memory accesses, which not only increases the
communication latencies, but also difficulties the developing
of efficient applications when compared to the shared-memory
programing model. These drawbacks highlight the need for
large-scale cache-coherent systems.
There is a current trend towards developing such large-scale
cache-coherent systems based on using existing commodity
symmetric multiprocessors (SMP), which requires to extend
their coherence protocol. AMD was the first to include such
features in their Opteron processors. Particularly, the six- and
twelve-core versions of AMD Opteron processors, codenamed
Istanbul and Magny-Cours [1] respectively, can be intercon-
nected to compound a larger system while still maintaining
cache coherence thanks to the Coherent HyperTransport (cHT)
technology [2]. Similarly, the Intel’s QuickPath Interconnect
§
This work was done before the author joined Intel, while being at the
Universitat Polit` ecnica de Val` encia.
(QPI) allows several Nehalem processors to compound a larger
coherent system.
In order to increase even more the number of processor
cores that can be kept coherent in such systems, several
proposals aimed at further extending the coherence domain
have appeared recently. We can find examples of these systems
either in the market (e.g., Horus [3] and SGI Altix UV [4])
or in the literature (e.g., EMC
2
[5], [6]). These hierarchical
systems have very different communication latencies among
processing cores depending on the distance, the interconnec-
tion technology, and its level in the coherence hierarchy, as
we can see in Figure 1. The basic building block is the die,
that can comprise several processor cores (currently from 4 to
12). Communication among these cores is very fast (just a few
nanoseconds) and can be carried out by a shared bus. Several
dies can be placed in the same board in order to compound
a larger system. Communication among dies is commonly
performed through a scalable point-to-point interconnect (e.g.,
cHT or QPI), and usually requires tens of nanoseconds [1].
Finally, several boards can be connected by an InfiniBand
[7] or Ethernet switch fabric. The component responsible for
managing communication between internal (intra-board) and
external (inter-board) messages is the bridge chip (also named
as HORUS chip in [3], as UV HUB in [4], and as EMC
2
in
[8]). Communication latency across the inter-board network
can be higher than one microsecond [9]. Since in these systems
the inter-board communication latency is extremely high when
compared to the other network latencies, the avoidance of this
communication becomes a fundamental goal for delivering
high performance.
In this paper, we present a characterization of the cache
misses that require coherence transactions among dies or
boards. This characterization represents the first and funda-
mental step of a work in progress whose final goal is the
design of a cache coherence protocol able to make the most
of the hierarchical systems. In particular, we are interested
2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications
978-0-7695-4701-5/12 $26.00 © 2012 IEEE
DOI 10.1109/ISPA.2012.102
691