* supported in part by DARPA through contract number FC 306020020525 under the PAC–C program, by the IEEC at SUNY–Binghamton and the NSF through award no. MIP 9504767 and EIA 9911099 Power Reduction in Superscalar Datapaths Through Dynamic Bit–Slice Activation * Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York, Binghamton, NY 13902–6000 e–mail:{dima, gurhan, ghose}@cs.binghamton.edu Abstract We show by simulating the execution of SPEC 95 benchmarks on a true hardware–level, cycle by cycle simulator for a superscalar CPU that about half of the bytes of operands flowing on the datapath, particularly the leading bytes, are all zeros. Furthermore, a significant number of the bits within the non–zero part of the data flowing on the various paths within the processor do not change from their prior value. We show how these two facts, attesting to the lack of a high level of entropy in the data streams, can be exploited to reduce power dissipation within all explicit and implicit storage components of a typical superscalar datapath such as register files, dispatch buffers, reorder buffers, as well as interconnections such as buses and direct links. Our simulation results and SPICE measurements from representative VLSI layouts show power savings of about 25% on the average over all SPEC 95 benchmarks. 1. Introduction Contemporary superscalar datapath designs attempt to push the performance envelope by employing aggressive out–of–order instruction execution mechanisms, which entail the use of datapath artifacts such as dispatch buffers (DBs), large register files and reorder buffers (ROBs) or variants. In addition, multiple function units (FUs) and sizeable on–chip caches are frequently employed. Dispatch buffers and reorder buffers are generally implemented as multiported register files, with additional logic, such as associative addressing facilities. All of these components are implicit forms of storage within the datapath while the register files are an explicit form of storage (in addition to the on–chip caches). All of the implicit and explicit storage components in a modern superscalar datapath dissipate a considerable amount of energy [6, 11, 13]. While the absolute power requirements of high–end superscalar processors have gone up steadily over the years as increasingly higher clock rates and smaller circuit components are being used, the situation with areal energy density (i.e., energy dissipated per unit area of the die) has become the immediate, serious concern [9]. Unless energy dissipation is controlled through technology independent techniques, the areal energy density will soon become comparable to that of nuclear reactors, as shown in [9] leading to intermittent and permanent failures on the die and also creating serious challenges for the cooling facilities. Furthermore, the areal energy density distribution across a typical chip is highly skewed, being lower over the on–chip caches and significantly higher elsewhere. The non–uniform thermal stresses that result are also problematic. This paper introduces a technology independent, dynamic solution for reducing the energy dissipation in the implicit and explicit storage components in a manner that does not impede performance in any way. More pleasantly, it reduces energy dissipations over components that have higher areal energy densities, such as DBs, ROBs and register files. In addition, we introduce a related technique for reducing dissipations on the wires interconnecting these and other storage components. Specifically, we exploit the presence of bytes containing all zeros, particularly in the higher–order bytes of operand values that are read out from physical registers, issued to function units, forwarded from function units or moved into the reorder and dispatch buffers. We avoid the activation of byte slices that contain all zeros along the interconnections and also within the implicit and explicit storage components of the processor to conserve power. The simulation results show that on the average about 50% of the bytes of operands are zeros. Furthermore, within the non–zero bytes, more than 65% of the bits are identical to what was driven immediately before on the data flow path. For the purpose of this paper, a data stream is a sequence of operand values from possibly different sources, that flow on an interconnection. Exploiting the presence of bytes containing all zeros is not new. Zero bytes can be encoded to compact the data and instructions. It was suggested and used to reduce the power dissipation in a dispatch buffer in [5]. In [12], the same fact is used to reduce energy dissipations within the primary data and instruction caches for SPECint95 and other