Computational and Storage Power Optimizations for the O-GEHL Branch Predictor Kaveh Aasaraai aasaraai@ece.uvic.ca Amirali Baniasadi amirali@ece.uvic.ca Ehsan Atoofian eatoofia@ece.uvic.ca Electrical and Computer Engineering Department University of Victoria 3800 Finnerty Rd. Victoria, BC, Canada ABSTRACT In recent years, highly accurate branch predictors have been proposed primarily for high performance processors. Un- fortunately such predictors are extremely energy consuming and in some cases not practical as they come with excessive prediction latency. One example of such predictors is the O-GEHL predictor. To achieve high accuracy, O-GEHL re- lies on large tables and extensive computations and requires high energy and long prediction delay. In this work we propose power optimization techniques that aim at reducing both computational complexity and storage size for the O-GEHL predictor. We show that by eliminating unnecessary data from computations, we can re- duce both predictor’s energy consumption and delay. More- over, we apply information theory findings to remove re- dundant storage, without any significant accuracy penalty. We reduce the dynamic and static power dissipated in the computational parts of the predictor by up to 74% and 65% respectively. Meantime we improve performance by up to 12% as we make faster prediction possible. Categories and Subject Descriptors C.1.0 [Computer Systems Organization]: Processor Ar- chitectures—General General Terms Design Performance Keywords Power-Aware Microarchitectures, O-GEHL, Branch Predic- tion Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CF’07, May 7–9, 2007, Ischia, Italy. Copyright 2007 ACM 978-1-59593-683-7/07/0005 ...$5.00. 1. INTRODUCTION Perceptron based predictors are highly accurate. The high accuracy is the result of exploiting long history lengths [9] and is achieved at the expense of high complexity. The Optimized GEometric History Length (or simply O- GEHL) predictor is an example of a perceptron like pre- dictor. O-GEHL relies on exploiting behavior correlations among branch instructions. To collect and store as much information as possible, the O-GEHL branch predictor uses multiple tables equipped with wide counters. The predic- tor uses the collected data and performs many steps before making the prediction. These steps include reading several counters from the tables and performing several computa- tions (e.g., additions and comparisons) on the collected data. In this work we revisit the O-GEHL predictor and show that while the conventional scheme provides high prediction accuracy, it is not efficient from the energy point of view. We are motivated by the following observations. First, our study shows that not all the computations performed by O-GEHL are necessary. This is particularly true for compu- tations performed on counter lower bits. As we show later, not all counter bits always impact the prediction outcome. Therefore excluding less important bits from the computa- tions, while reducing energy consumption, may not impact accuracy. Second, we have observed that the tables used by O-GEHL store redundant data. We show that the stored data can be represented using less storage if this redundancy is taken into account. We rely on the above observations and introduce two pow- er optimization techniques. Our techniques aim at reduc- ing the power dissipated by the computation and storage resources. We reduce power for computation resources by eliminating unnecessary and redundant counter bits from computations and by accessing and using fewer bits at the prediction time. We reduce power for storage resources by representing the required data using less bits. We achieve this by having multiple counters sharing their lower bits. We show that by intelligent bit sharing it is possible to reduce predictor size while maintaining its accuracy. It should be noted that since our optimizations are not performed dy- namically, they come with no latency or power overhead at runtime. By applying our techniques we not only reduce power but also improve processor performance. This is due to the fact 105