DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-ahead Architecture Raj Parihar Tensilica R&D and IP Group Cadence Design Systems, San Jose, CA parihar@cadence.com Michael C. Huang Dept. of Eletrical & Computer Engineering University of Rochester, Rochester, NY michael.huang@rochester.edu Abstract In spite of the multicore revolution, high single thread per- formance still plays an important role in ensuring a decent overall gain. Look-ahead is a proven strategy in uncovering implicit parallelism; however, a conventional out-of-order core quickly becomes resource-inefficient when looking be- yond a short distance. An effective approach is to use an in- dependent look-ahead thread running on a separate context guided by a program slice known as the skeleton. We observe that fixed heuristics to generate skeletons are often subopti- mal. As a consequence, look-ahead agent is not able to target sufficient bottlenecks to reap all the benefits it should. In this paper, we present DRUT, a holistic hardware- software solution, which achieves good single thread perfor- mance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branch based code modules (we call them Do-It-Yourself or DIY) that enable a faster look-ahead thread without compromising the quality of the look-ahead. Second, we extend our tuning mechanism to any arbitrary code region and use a profile- driven technique to tune the skeleton for the whole program. Assisted by the aforementioned techniques, look-ahead thread improves the performance of a baseline decoupled look-ahead by up to 1.93× with a geometric mean of 1.15×. Our techniques, combined with the weak dependence re- moval technique, improve the performance of a baseline look-ahead by up to 2.12× with a geometric mean of 1.20×. This is an impressive performance gain of 1.61× over the single-thread baseline, which is much better compared to conventional Turbo Boost with a comparable energy budget. Keywords: Implicit parallelism, Turbo Boost, Decoupled look-ahead, Do-It-Yourself branches, Skeleton tuning 1 Introduction For the past decade or so, mainstream microprocessors have been increasing the number of cores packaged in a single chip. The trend is more pronounced in server chips than in those used in desktops or mobile devices [23, 34, 55, 57]. The increase in core count directly improves system through- put when there is enough explicit parallelism in the work- load. For a single program, the case is less obvious; some are easier to parallelize, others are more difficult. New tools and programming models constantly emerge to make it eas- ier [4, 13, 27, 50, 54, 62]. Despite these endeavors, parallel codes–especially the efficient one–take more effort to write, debug, and maintain. Even when a program has already been parallelized, the efficiency tends to drop due to increased syn- chronization overhead as the number of threads increase [65]. In other words, increasing the number of cores does not nec- essarily translate to increased performance and often requires significant tuning effort to realize the available potential. Multithreaded execution [1, 14, 31, 36, 39, 43, 49, 53, 61, 63, 64], special-purpose accelerators [17, 30] and increasing performance of a single-thread [25, 26, 45, 52] are all part of the arsenal to improve program execution speed in an energy- efficient manner. A primary appeal of exploiting implicit par- allelism is its broad applicability to all kinds of programs. For a long time, increasing single-thread performance has been the central focus in the microprocessor industry and, to a lesser extent, in the related research community. With the significant slowdown in processor clock and microarchitec- tural improvement, single-thread performance is lacking two significant traditional drivers. Yet it remains a key processor design goal, as it offers across-the-board benefits. Recent commercial microarchitectures are already em- ploying large reorder buffers, wider pipelines and clustered resources to sustain high single-thread performance [5, 10, 23,24,28,29,32–35,37]. Soon it will be hard for a monolithic core to keep pace with growing demands. Fortunately, ma- jor challenges in single-thread performance are not due to the lack of potential. Indeed, there is a significant implicit paral- lelism both in old benchmarks used in classic studies and in their modern counterparts [22, 47]. Current systems are far from exhausting this implicit parallelism and exploration of various opportunities in this domain is a worthwhile pursuit. One effective approach to achieve the next big jump in (much needed) single-thread performance is by smarter look- ahead techniques via better decoupling of resources. Execut- ing dedicated code to run ahead of the main program execu- tion to help extract implicit parallelism is a promising tech- nique and requires further attention. Due to the prolifera- tion and ubiquity of multi-core architectures, a more decou- pled look-ahead architecture is a serious candidate for perfor- mance boosting. While the decoupled look-ahead technique has shown significant performance benefits, the look-ahead thread has often become the new bottleneck [22, 47]. Fortu- nately, without hard correctness constraints, there are more opportunities than challenges to solve the problem. In this paper, we propose a holistic hardware-software so- lution (known as DRUT 1 ) to improve the speed of the look- ahead thread without compromising the quality of the look- ahead. DRUT not only improves the overall performance but also brings significant savings in energy and power consump- tion, making decoupled look-ahead a very attractive alterna- tive to traditional turbo boosting based on frequency scaling. 1 Dynamic Removal of Unwanted code and Tuning. 1