A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors Manuel E. Acacio, Jose´ Gonza´lez, Member, IEEE Computer Society, Jose´ M. Garcı ´a, Member, IEEE, and Jose´ Duato, Member, IEEE Abstract—One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that it scales very slowly with the size of the machine. Unfortunately, current directory architectures provide scalability at the expense of performance. This work presents a scalable directory architecture that significantly reduces the size of the directory for large-scale configurations of a multiprocessor without degrading performance. First, we propose multilayer clustering as an effective approach to reduce the width of directory entries. Based on this concept, we derive three new compressed sharing codes, some of them with a space complexity of O log 2 log 2 N ð Þ ð Þ ð Þ for an N-node system. Then, we present a novel two-level directory architecture to eliminate the penalty caused by compressed directories in general. The proposed organization consists of a small full-map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in- excess information for all the lines). The proposals are evaluated based on extensive execution-driven simulations (using RSIM) of a 64-node cc-NUMA multiprocessor. Results demonstrate that a system with a two-level directory architecture achieves the same performance as a multiprocessor with a big and nonscalable full-map directory, with a very significant reduction of the memory overhead. Index Terms—Scalability, directory memory overhead, two-level directory architecture, compressed sharing codes, unnecessary coherence messages, cc-NUMA multiprocessor. æ 1 INTRODUCTION T HE key property of shared-memory multiprocessors is that communication occurs implicitly as a result of conventional memory access instructions (i.e., loads and stores) which makes them easier to program, and thus, preferred from a programmer’s perspective, than message- passing machines. Shared-memory multiprocessors cover a wide range of prices and features, from commodity SMPs to large high- performance cc-NUMA machines, such as the SGI Origin 2000/3000. Most shared-memory multiprocessors employ the cache hierarchy to reduce the time needed to access memory by keeping data values as close as possible to the processor that uses them. However, caching data values in a shared-memory multiprocessor introduces two major co- herence problems, which are shown in Fig. 1. First, when multiple processors read the same location they create shared copies of memory in their respective caches (see Fig. 1a). If, subsequently, the location is written, some action must be taken to ensure that the other processor caches do not supply stale data. In most cases, the cached copies are eliminated through invalidations (see Fig. 1b). After completing the write, the writing processor has a dirty copy of the cache line, which allows to subsequently write the line by only updating its cached copy (see Fig. 1c). The second coherence problem arises when other processors reread this dirty line. When lines are dirty, simply reading a location may return a stale value from memory. To eliminate this problem, reads also require interaction with other processor caches. In this case, the cache that holds the requested line dirty provides a copy of its memory line, overriding the response from memory. At the same time, main memory is also updated (see Fig. 1d). Particular implementations of cache coherence protocols are quite different depending on the total number of processors. For systems with small processor counts, a common bus is usually utilized along with snooping cache coherence protocols. Snooping protocols [1] solve the cache coherence problem using a network with a completely ordered message delivery (traditionally a bus) to broadcast coherence transactions directly to all processors and memory. Unfortunately, the broadcast medium becomes a bottleneck (due to both the limited bandwidth that it provides and the limited number of processors that can be attached to it) preventing them from being scalable. Instead, scalable shared-memory multiprocessors are constructed based on scalable point-to-point interconnection networks, such as a mesh or a torus [2]. Besides, main memory is physically distributed to ensure that the bandwidth needed to access main memory scales with the number of processors. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 1, JANUARY 2005 67 . M.E. Acacio and J.M. Garcı´a are with the Departamento de Ingenierı´a y Tecnologı´a de Computadores, Universidad de Murcia, Campus de Espinardo S/N, Facultad de Informa´tica, 30071 Murcia, Spain. E-mail: {meacacio, jmgarcia}@ditec.um.es. . J. Gonza´lez is with Intel Barcelona Research Center, Intel Labs Barcelona, C/ Jordi Girona 29, Edif. Nexus 2, Planta 3, 08034 Barcelona, Spain. E-mail: pepe.gonzalez@intel.com. . J. Duato is with the Departamento de Informa´tica de Sistemas y Computadores, UniversidadPolite´cnica de Valencia, Camino de Vera S/N, 46010 Valencia, Spain. E-mail: jduato@gap.upv.es. Manuscript received 30 June 2003; revised 24 Dec. 2003; accepted 24 July 2004; published online 23 Nov. 2004. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-0100-0603. 1045-9219/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society