Distributing the Frontend for Temperature Reduction Pedro Chaparro, Grigorios Magklis, José González and Antonio González Intel Barcelona Research Center - Intel Labs – UPC {pedro.chaparro.monferrer, grigoriosx.magklis, pepe.gonzalez, antoniox.gonzalez}@intel.com Abstract Due to increasing power densities, both on-chip average and peak temperatures are fast becoming a serious bottleneck in processor design. This is due to the cost of removing the heat generated, and the performance impact of dealing with thermal emergencies. So far microarchitectural techniques to control temperature have mainly focused on the processor backend (in particular the execution units), whereas the frontend has not received much attention. However, as the temperature of the backend remains controlled and the processor throughput increases, the heat dissipated by the frontend becomes more significant, and one of the major contributors to the total average temperature. This paper proposes and evaluates a distributed frontend for clustered microarchitectures that is able to reduce power density and temperature. First, a distributed mechanism for renaming and committing instructions is proposed. Second, a sub-banked trace cache with a bank hopping mechanism is presented. Finally, a method to improve the sub-banking is proposed based on a biased mapping function to distribute bank accesses to balance temperature. 1. Introduction Power dissipation is one of the major hurdles in the design of next-generation microarchitectures. Power density is increasing in each generation due to the fact that frequency and leakage currents are scaling up very fast and their effect cannot be offset by decreasing the supply voltage. Power density directly translates into heat which must be removed from the processor die in order to keep the silicon temperature below a certain limit. The increase in power density makes the cost of the cooling system grow and challenges the performance benefits that can be obtained by the ever growing transistor density. For instance, traditionally the cooling system of a processor was designed to support the worst case peak temperature. Because of the growth of the cooling solution cost and some form factor constraints (especially in mobile computers), the cooling system is now designed for the common case and a thermal emergency mechanism is in charge of restoring the processor to its operating temperature. This solution has been adopted because the processor spends most of the time running at much lower temperatures than the worst-case scenario. Whenever a thermal emergency arises, a back-up mechanism to cool down the chip is triggered. Such mechanisms have a negative impact on performance. The cost of the cooling system has been quantified in the order of $1-3 or more per Watt when the average power exceeds 40 Watts [4][14], which represents a significant part of the total cost of the chip. The cost of the heat removal system is especially important for data centers where air conditioning is a main contributor over the whole data center cost [22]. Furthermore, circuit reliability depends exponentially upon operating temperature. Temperature variations account for over 50% of electronic failures [28]. In order to reduce dynamic power dissipation, chip designers rely on scaling down the supply voltage. To counteract the negative effect of a lower supply voltage on gate delay, the threshold voltage is also scaled down along with the supply voltage. However, lowering the threshold voltage has a significant impact on leakage current due to the exponential relationship between them. In fact, it is expected that within a few process generations the contribution of leakage power to the total power will be comparable to that of dynamic power [4][9]. It is also important to note that leakage power is exponentially dependent on temperature. On the other hand, wire delays scale much slower than gate delays [1][3][21] and pose a serious obstacle to the scalability of superscalar processors. Clustered microarchitectures are an effective organization to deal with the problem of wire delays and complexity by means of partitioning some of the processor resources [6][11], such as the processor backend, and attempting to minimize the use of global (slow) communications. Clustered microprocessors achieve a significant reduction of the backend temperature due to an Proceedings of the 11th Int’l Symposium on High-Performance Computer Architecture (HPCA-11 2005) 1530-0897/05 $20.00 © 2005 IEEE