Wire Management for Coherence Traffic in Chip Multiprocessors Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, John Carter School of Computing, University of Utah * Abstract Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architec- tures due to their low design complexity, high clock frequency, and high throughput. In a typical CMP ar- chitecture, the L2 cache is shared by multiple cores and data coherence is maintained among private L1s. Coherence operations entail frequent communication over global on-chip wires. In future technologies, com- munication between different L1s will have a signif- icant impact on overall processor performance and power consumption. On-chip wires can be designed to have different latency, bandwidth, and energy properties. Like- wise, coherence protocol messages have different la- tency and bandwidth needs. We propose an intercon- nect comprised of wires with varying latency, band- width, and energy characteristics, and advocate intel- ligently mapping coherence operations to the appro- priate wires. In this paper, we present a comprehen- sive list of techniques that allow coherence protocols to exploit a heterogeneous interconnect and present preliminary data that indicates the potential of these techniques to significantly improve performance and reduce power consumption. We further demonstrate that most of these techniques can be implemented at a minimum complexity overhead. 1. Introduction Advances in process technology have led to the emergence of new bottlenecks in future micropro- cessors. One of the chief bottlenecks to perfor- * This work was supported in part by NSF grant CCF-0430063 and by Silicon Graphics Inc. mance is the high cost of on-chip communication through global wires [19]. Power consumption has also emerged as a first order design metric and wires contribute up to 50% of total chip power in some processors [28]. Future microprocessors are likely to exploit huge transistor budgets by employing a chip multi-processor (CMP) architecture [30, 32]. Multi- threaded workloads that execute on such processors will experience high on-chip communication latencies and will dissipate significant power in interconnects. In the past, the design of interconnects was primarily left up to VLSI and circuit designers. However, with communication emerging as a larger power and per- formance constraint than computation, architects may wish to consider different wire implementations and identify creative ways to exploit them [6]. This paper presents a number of creative ways in which coherence communication in a CMP can be mapped to different wire implementations with minor increases in com- plexity. We present preliminary results that demon- strate that such an approach can both improve perfor- mance and reduce power dissipation. In a typical CMP, the L2 cache and lower lev- els of the memory hierarchy are shared by multiple cores [22, 32]. Sharing the L2 cache allows high cache utilization and avoids duplicating cache hardware re- sources. L1 caches are typically not shared as such an organization entails high communication latencies for every load and store. Maintaining coherence be- tween the individual L1s is a challenge in CMP sys- tems. There are two major mechanisms used to en- sure coherence among L1s in a chip multiprocessor. The first option employs a bus connecting all of the L1s and a snoopy bus-based coherence protocol. In this design, every L1 cache miss results in a coherence message being broadcast on the global coherence bus. Individual L1 caches perform coherence operations on their local data in accordance with these coherence