Published in IET Computers & Digital Techniques Received on 13th July 2010 Revised on 10th November 2010 doi: 10.1049/iet-cdt.2010.0102 ISSN 1751-8601 Exploring branch target buffer access ﬁltering for low- energy and high-performance microarchitectures S. Wang 1 J. Hu 2 S.G. Ziavras 3 1 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiang Su 210093, People’s Republic of China 2 Intel Corporation, Portland, OR 97124, USA 3 Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA E-mail: swang@nju.edu.cn Abstract: Powerful branch predictors along with a large branch target buffer (BTB) are employed in superscalar and simultaneous multi-threading (SMT) processors for instruction-level parallelism and thread-level parallelism exploitation. However, the large BTB not only dominates the predictor energy consumption, but also becomes a major roadblock in achieving faster clock frequencies at deep sub-micron technologies. The authors propose here a ﬁltering scheme to dramatically reduce the accesses to the BTB to achieve signiﬁcantly reduced energy consumption in the BTB while maintaining the performance. For a simulated superscalar microprocessor, the experimental evaluation shows that the BTB access ﬁltering (BAF) design achieves an 88.5% dynamic energy reduction with negligible performance loss. The authors also study the leakage behaviour and its control in the BAF design. The results show that by applying a drowsy strategy, very effective leakage control can be achieved. For the high-performance design, the BAF can also improve BTB’s performance scalability at new technologies. For the simultaneous multi-threading environment, the authors evaluate the effectiveness of the BAF design and propose a banked BAF (BK-BAF) scheme to further reduce the energy consumption and performance overhead. The experimental results conﬁrm that the BK-BAF scheme can be an energy/performance-effective design for next generation SMT processors. 1 Introduction Modern high-performance superscalar processor design is mainly driven by techniques exploiting high instruction- level parallelism (ILP) and faster clock frequencies with continuously advancing complementary metal-oxide semiconductor (CMOS) technology. Besides out-of-order issue/execution and register renaming, speculative execution is another major form of ILP exploitation. With correct branch predictions, the processor not only eliminates pipeline stalls due to control hazards, but also enables the datapath front end to supply sufﬁcient instructions for ILP exploitation at later pipeline stages. However, a mispredicted branch requires ﬂushing from the datapath pipeline all instructions fetched along the speculated path and reﬁlling the pipeline with new instructions from the resolved target address. Therefore the branch misprediction penalty is recognised as a major performance limiter in speculative superscalar processors. A highly accurate branch predictor is of critical importance in the design of ILP processors, which has been the focus of tremendous research efforts [2–5]. As the direction predictor is getting more and more sophisticated, a large BTB [6] is usually adopted to supply target addresses for predicted taken branches, leading to non-trivial energy consumption in branch predictors [7, 8]. Recent study [8] shows that the branch prediction unit contributes a non-trivial percentage, about 7 – 10% to the total processors energy consumption. Our experimental results show that a typical 2k-entry two- way set-associative BTB dissipates about 86% of the total branch prediction unit energy in a simulated Alpha 21364 processor. For SMT processors, which target at exploiting thread-level parallelism (TLP), the situation is even worse, since larger BTBs are needed in order to support the speculative execution for multiple threads [9–13]. Consequently, energy optimisation in the BTB is becoming an indispensable component in the design of energy-aware processors at deep sub-micron technologies. As the logic depth of the pipeline stage keeps reducing [14] at deeply pipelined designs for higher clock frequencies, operations in many conventional large monolithic structures such as the issue queue and register ﬁle can no longer be completed within a single pipeline stage, eventually leading to reduced performance and signiﬁcantly increased design complexity. Therefore plenty of research has been devoted to exploring complexity-effective issue queue and register ﬁle designs, for example, [15–20] among many others. On the other hand, research on branch prediction has traditionally focused on direction prediction [2–5]. There is limited work on the energy optimisation in branch predictors [7, 8, 21–23], and very few on the complexity and performance scalability of predictor designs. Moreover, 50 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 1, pp. 50–58 & The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2010.0102 www.ietdl.org