DiCo-CMP: Efficient Cache Coherency in Tiled CMP Architectures Alberto Ros, Manuel E. Acacio, Jos´ e M. Garc´ ıa Departamento de Ingenier´ ıa y Tecnolog´ ıa de Computadores Universidad de Murcia Campus de Espinardo S/N, 30100 Murcia, Spain {a.ros,meacacio,jmgarcia}@ditec.um.es Abstract Future CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs or- ganized around a direct interconnection network will prob- ably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as Token-CMP does) or any other brute-force method for keeping cache coher- ence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory proto- cols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol es- pecially suited to future tiled CMP architectures. In DiCo- CMP the role of storing up-to-date sharing information and ensuring totally ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency com- pared to a directory protocol by sending coherence mes- sages directly from the requesting caches to those that must observe them (as it would be done in brute-force protocols), and reduces the network traffic compared to Token-CMP (and consequently, power consumption in the interconnec- tion network) by sending just one request message for each miss. Using an extended version of GEMS simulator we show that DiCo-CMP achieves improvements in execution time of up to 8% on average over a directory protocol, and reductions in terms of network traffic of up to 42% on aver- age compared to Token-CMP. 1. Introduction The huge number of transistors that are currently of- fered in a single die has made major microprocessor ven- dors to shift towards multi-core architectures in which sev- eral processor cores are integrated on a single chip. Chip- multiprocessors (CMPs) [24] have important advantages over very wide-issue out-of-order superscalar processors. In particular, they provide higher aggregate computational power, multiple clock domains, better power efficiency, and simpler design through replicated building blocks. Most current CMPs (for example, the IBM Power5 [13]) have a relatively small number of cores (2 to 8), every one with at least one level of private cache. These cores are typically connected through an on-chip shared bus or cross- bar. However, the interesting new opportunity is now that Moore’s Law will make it possible to double the number of cores every 18 months [7], making undesirable elements that could compromise the scalability of these designs. One of such elements is the interconnection network. As shown in [15], the area required by a shared bus or a crossbar as the number of cores grows has to be increased to the point of becoming impractical. Tiled CMP architectures have re- cently emerged as a scalable alternative to current CMP de- signs, and future CMPs will be probably designed as arrays of replicated tiles connected over a switched direct network [28, 31]. On the other hand, most CMP systems provide program- mers with the intuitive shared-memory model, which re- quires efficient support for cache coherency. Although a great deal of attention was devoted to scalable cache coher- ence protocols in the last decades in the context of shared- memory multiprocessors, the technological parameters and power constrains entailed by CMPs demand new solutions to the cache coherency problem [7]. Directory-based cache coherence protocols have been typically employed in systems with point-to-point un- ordered networks (as tiled CMPs are). Unfortunately, these protocols introduce indirection to obtain coherence infor- mation from the directory (commonly on chip as a direc- tory cache), thus increasing cache miss latencies. An alter- native approach that avoids indirection is Token-CMP [22]. Token-CMP is based on broadcasting requests to all last- level private caches. In this way, caches can directly pro- vide data when they receive a request (no indirection oc-