Complex Branch Profiling for Dynamic Conditional Execution Rafael R. dos Santos 1 , Tatiana G. S. dos Santos 1 , Mauricio L. Pilla 1 , Philippe O. A. Navaux 1 , Sergio Bampi 1 , Mario Nemirovsky 2 1 Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil [rrsantos,tatiana,pilla,navaux,bampi]@inf.ufrgs.br 2 Kayamba San Jose, CA, USA mario@ieee.org Abstract— Branch predictors are widely used as an alternative to deal with con- ditional branches. Despite the high accuracy rates, misprediction penal- ties are still large in any superscalar pipeline. DCE, or Dynamic Con- ditional Execution, is a alternative to reduce the number of predicted branches by executing both paths of certain branches thus reducing the number of predictions and therefore the misprediction occurrence. The goal of this work is to analyze the complexity of branch structures and determine the number of branches that can be predicated in DCE and the distribution of mispredictions according to the classification pro- posed. The complex branch classification proposed extends the classi- fication presented by Klauser [KLA98]. As result, it is showed that an average of 35% of all branches can be predicated in DCE and around 32% of mispredictions fall into these branches. Keywords— Superscalar architectures, Multipath execution, Branch prediction, Dynamic predication I. I NTRODUCTION Conditional branches present a challenge to increase per- formance in current processors. Many mechanisms have been presented to mitigate their impact, like branch pre- diction, speculative execution, trace caches, instruction prefetching, and multipath execution. These mechanisms rely on fetching more instructions to feed the starving functional units [UHT95, TYS97, HEI96, KLA98, SAN98, SKA99]. The most common technique, implemented in all modern processors, is branch prediction. Branch prediction is widely used to predict the next instruction fetch address when con- ditional and/or unconditional branches are fetched. Never- theless, despite the accuracy of current predictors, mispre- dictions still degrades considerably the performance. Superscalar pipelines are getting deeper to support in- creasing demands for clock frequency. Therefore when more stages are added, the penalty imposed by branch mispredic- tions increases. The accuracy of state of the art branch pre- dictors, however, is not improving. Increasing the accuracy of current predictors implies an extraordinary increase in complexity. Since the predictor has Grants from CNPq, CAPES and UNISC now to be split among several stages of fetch, many levels of predictions are necessary, again, due to the high clock fre- quencies required. One can avoid branch predictions by means of executing both paths of a conditional branch. For instance, multipath execution was extensively studied in the past but simply ex- ecuting all paths of all branches proved not to be efficient. DCE [SAN01, SAN03] presents an alternative where only certain conditional branches are predicated. DCE can per- form dynamic predication of simple and complex conditional branches without requiring a special instruction set nor spe- cial compiler optimizations hence it can be applied to legacy code. This paper presents an analysis on the behavior and pat- terns of direct conditional branches in order to better under- stand the dynamics of control structures and to quantify the number of branches that can be predicated. First, DCE is briefly presented in Section II. Then, Section III introduces a classification for hammocks that extends the classification presented by Klauser et al [KLA98]. The simulation envi- ronment is presented in Section IV, and the result analysis is discussed in Section V. The conclusions are drawn in the last section. II. DYNAMIC CONDITIONAL EXECUTION DCE combines dynamic predication and multipath to re- duce the complexity and disruptions of the fetch. This is achieved by fetching sequentially through branches that qual- ify for predication. In order to determine if a branch qualify for predication, an extension of the selection mechanism proposed in [KLA98] was developed. In their selection mechanism, only simple branches qualify for predication. The selection scheme used in DCE also qualify complex branches. The selection mechanism is static and runs at compila- tion time, marking branches which can be predicated accord- ing to the target locality. The compiler does not change the