An L2-Miss-Driven Early Register Deallocation for SMT Processors Joseph Sharkey Dmitry Ponomarev Department of Computer Science, State University of New York, Binghamton, NY 13902-6000 {jsharke, dima }@cs.binghamton.edu ABSTRACT The register file is one of the most critical datapath components limiting the number of threads that can be supported on a Simultaneous Multithreading (SMT) processor. To allow the use of smaller register files without degrading performance, techniques that maximize the efficiency of using registers through aggressive register allocation/deallocation can be considered. In this paper, we propose a novel technique to early deallocate physical registers allocated to threads that experience L2 cache misses. This is accomplished by speculatively committing the load-independent instructions and deallocating the registers corresponding to the previous mappings of their destinations, without waiting for the cache miss request to be serviced. The early deallocated registers are then made immediately available for allocation to instructions within the same thread as well as within other threads, thus improving the overall processor throughput. On the average across the simulated mixes of multiprogrammed SPEC 2000 workloads, our technique results in 33% improvement in throughput and 25% improvement in terms of harmonic mean of weighted IPCs over the baseline SMT with the state-of-the-art DCRA policy. This is achieved without creating checkpoints, maintaining per-register counters of pending consumers, performing tag re-broadcasts, register re-mappings and/or additional associative searches. Categories and Subject Descriptors C.1 [Processor Architectures]: Other Architecture Styles –Pipeline processors. General Terms: Performance, Design Keywords: Simultaneous Multithreading, Register Files 1. INTRODUCTION Simultaneous Multithreading (SMT) is an important architectural paradigm for increasing the processor throughput in an area-efficient manner by sharing the key datapath resources among the instructions from multiple threads [18,25]. One such shared resource in an SMT datapath is the physical register file (RF), which needs to be sized very generously to support the full architectural state for each thread as well as to provide a sufficient number of additional registers for renaming, all within a common RAM structure. For example, for an ISA with 32 architectural registers, 128 registers are needed to maintain the precise state for a 4-threaded SMT, in both integer and floating point RFs. When renaming registers are taken into account, the total number of entries within each RF can reach several hundred. The large access delays, high power consumption, and significant design complexity associated with such RFs are major factors limiting the number of simultaneous threads that can be supported by an SMT machine, especially at high frequency implementations. Pipelining the access to large RFs over several cycles requires multiple levels of bypass and also lengthens the branch resolution and the load-hit speculation loops [3]. An alternative to building large RFs is to use a smaller number of registers in a more efficient fashion. Higher efficiency of register utilization in an SMT processor can be achieved by addressing two related issues: 1) how to distribute the available registers among the threads, and 2) how to manage these registers in order to provide a larger supply for distribution. We generically refer to these two key aspects as register distribution and register management. Register distribution. This issue is addressed in the recent literature through a series of proposals, such as I-Count [25], STALL [24], FLUSH [24], DCRA [5], and Hill-Climbing [7] techniques. I-Count gives fetching (and thus register allocation) priority to threads with fewer not-yet-executed instructions. The FLUSH mechanism completely squashes a thread which experienced a long-latency L2 cache miss, releasing all physical registers allocated to this thread and assigning them to other threads while the cache miss is being serviced. The STALL mechanism simply blocks further resource allocations to such threads, without squashing the in-flight instructions. FLUSH generally provides higher performance than STALL [24], but it also incurs non-trivial overheads, because the squashed instructions have to always be re-fetched, rescheduled, re- executed and all shared resources have to be reallocated to these instructions again. Consequently, the first allocations, performed prior to the discovery of a cache miss, just waste the resources, even for the load-independent instructions that executed without problems. The DCRA policy takes a different approach and instead allocates more resources to memory-bound threads, attempting to help their performance. The Hill-Climbing mechanism further improves on DCRA by observing the impact of resource distribution decisions at run time and feeding this information back to the front end of the pipeline to guide future allocations. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’07, June 16–20, 2007, Seattle, Washington, USA. Copyright 2007 ACM 978-1-59593-768-1/07/0006...$5.00. 1