Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability * Yongjun Park Jason Jong Kyu Park Hyunchul Park Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan Ann Arbor, MI, USA {yjunpark, jasonjk, parkhc, mahlke}@umich.edu Abstract Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these de- vices will be driven by providing an even richer user experience and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, games, and voice interfaces. To address these goals, the core computing capabilities of the smart phone must be scaled. However, the energy budgets are increasing at a much lower rate, requiring fundamental improvements in computing efficiency. SIMD accelerators offer the combination of high performance and low energy consumption through low control and interconnect over- head. However, SIMD accelerators are not a panacea. Many ap- plications lack sufficient vector parallelism to effectively utilize a large number of SIMD lanes. Further, the use of symmetric hard- ware lanes leads to low utilization and high static power dissipation as SIMD width is scaled. To address these inefficiencies, this paper focuses on breaking two traditional rules of SIMD processing: ho- mogeneity and static configuration. The Libra accelerator increases SIMD utility by blurring the divide between vector and instruction parallelism to support efficient execution of a wider range of loops, and it increases hardware utilization through the use of heteroge- neous hardware across the SIMD lanes. Experimental results show that the 32-lane Libra outperforms traditional SIMD accelerators by an average of 1.58x performance improvement due to higher loop coverage with 29% less energy consumption through heterogeneous hardware. 1. Introduction The mobile devices market, including cell phones, netbooks, and per- sonal digital assistants, is one of the most highly competitive busi- nesses. The computing platforms that go into these devices must provide ever increasing performance capabilities while maintaining low energy consumption in order to support advanced multimedia and signal processing applications. Application-specific integrated circuits (ASICs) are the most common solutions for meeting these requirements, performing the most compute-intensive kernels in a high performance but energy-efficient manner. However, several fea- tures push designers to a more flexible and programmable solution: supporting multiple applications or variations of applications, pro- viding faster time-to-market, and enabling algorithmic changes after the hardware is constructed. Processors that exploit instruction-level parallelism (ILP) provide the highest degree of computing flexibility. Modern smart phones employ a one GHz dual-issue superscalar ARM as an application processor. Higher performance digital signal processors are also * To appear in the 45th International Symposium on Microarchitecture (2012). Currently with Programming Systems Lab, Intel Labs, Santa Clara, CA available such as the 8-issue TI C6x. However, ILP processors have scalability limits including many-ported register files (RFs) and complex interconnects. Alternately, single-instruction multiple-data (SIMD) accelerators provide high efficiency because of their regu- lar structure, ability to scale lanes, and low control logic overhead. They have long been used in the desktop space for high performance multimedia and graphics functionality. But, their combination of scalable performance, energy efficiency, and programmability make them ideal for mobile systems [24, 9, 15, 27]. In order to fully utilize the SIMD hardware, it is necessary for the programmer or compiler to extract sufficient data-level parallelism (DLP). Automatic loop vectorization is available in a variety of com- mercial compilers including offerings from Intel, IBM, and PGI. Classic scientific computing (regular structure, large trip count loops, and few data dependences) are naturally well-matched to SIMD ac- celerators. But, in many respects, the mobile terminal has become a general-purpose computer. Thus, like the desktop, only a small percentage of mobile applications look like classic scientific com- puting. The computation is not dominated by simple vectorizable loops, but by loops containing significant numbers of control and data dependences to handle the complexity of modern multimedia standards. As a result, applications have varying amounts of vec- tor parallelism ranging from none to some to large amounts. The net effect is that SIMD hardware goes unused for a large fraction of application execution and thus cannot be counted on to provide significant performance gains. A second but inter-related problem with SIMD computing is low hardware utilization even when vector loops are executed. The use of homogeneous hardware (e.g, identical lanes) is one of the best ad- vantages of SIMD datapaths by reducing design cost and complex- ity. But, the utilization of the most complex components of a SIMD lane is often disproportionally lower than the simple components. For example, the H.264 video decoding application is dominated by simple integer operations (adds, subtracts, shifts) and an average of only 2.2% and 1.3% of the total dynamic instructions are multiplies and divides [8]. This is not an outlying data point, most multimedia and visual computing applications have small fractions of multiply, divide and other expensive operators. For 128-bit SIMD (4 lanes), such utilization rates may not matter, but as SIMD widths are scaled to increase performance to 1024 bits (32 lanes) or more, the problem becomes serious due to poor area utilization and high static power dissipation. To attack these problems, we propose a customizable SIMD accel- erator that is capable of tailoring its execution strategy to the running application, referred to as the Libra. Libra employs two key con- cepts, heterogeneity and dynamic configurability, to achieve broader applicability and better energy efficiency than traditional SIMD ac- celerators. Heterogeneity allows lanes to have different functionali-