VEAL: Virtualized Execution Accelerator for Loops Nathan Clark 1 , Amir Hormati 2 , and Scott Mahlke 2 1 College of Computing 2 Advanced Computer Architecture Laboratory Georgia Institute of Technology University of Michigan - Ann Arbor ntclark@cc.gatech.edu {hormati, mahlke}@umich.edu Abstract Performance improvement solely through transistor scaling is becoming more and more difficult, thus it is increasingly common to see domain specific accelerators used in conjunc- tion with general purpose processors to achieve future perfor- mance goals. There is a serious drawback to accelerators, though: binary compatibility. An application compiled to uti- lize an accelerator cannot run on a processor without that ac- celerator, and applications that do not utilize an accelerator will never use it. To overcome this problem, we propose decou- pling the instruction set architecture from the underlying ac- celerators. Computation to be accelerated is expressed using a processor’s baseline instruction set, and light-weight dynamic translation maps the representation to whatever accelerators are available in the system. In this paper, we describe the changes to a compilation framework and processor system needed to support this ab- straction for an important set of accelerator designs that sup- port innermost loops. In this analysis, we investigate the dy- namic overheads associated with abstraction as well as the static/dynamic tradeoffs to improve the dynamic mapping of loop-nests. As part of the exploration, we also provide a quanti- tative analysis of the hardware characteristics of effective loop accelerators. We conclude that using a hybrid static-dynamic compilation approach to map computation on to loop-level ac- celerators is an practical way to increase computation effi- ciency, without the overheads associated with instruction set modification. 1 Introduction For decades, industry has produced, and consumers have re- lied on, exponential performance improvements from micro- processor systems. This continual performance improvement has enabled many applications, such as real-time ray tracing, that would have been computationally infeasible only a few years ago. Despite these advances, many compelling applica- tion domains remain beyond the scope of everyday computer systems, and so the quest for more performance remains an ac- tive research goal. The traditional method of performance improvement, through increased clock frequency, has fallen by the wayside as the increased power consumption now outweighs any per- formance benefits. This development has spurned a great deal of recent research in the area of multicore systems: trying to provide efficient performance improvements through increased parallelism. Not all applications are well suited for multicore environ- ments, though. In these situations, an increasingly popular way to provide more performance is through customized hardware. Adding application specific integrated circuits (ASICs) or ap- plication specific instruction set processors (ASIPs) to a gen- eral purpose design not only provides significant performance improvements, but also major reductions in power consump- tion as well. There are many examples of customized hardware being effectively used as part of a system-on-chip (SoC) in in- dustry, for example the encryption coprocessor in Sun’s Ultra- SPARC T2 [23]. The main drawback of this approach is that creating special- ized hardware accelerators for each targeted application car- ries significant costs. Hardware design and verification effort, software porting, and fabrication challenges all contribute to the substantial non-recurring engineering costs associated with adding new accelerators. Purchasing accelerator designs, in the form of intellectual property (IP), is a popular option to allevi- ate some of the hardware design costs, but there are still sig- nificant integration costs (both hardware and software) in tying the IP accelerators into the rest of the system. The goal of this work is to attack those costs. First, we present the design of a hardware accelerator that effectively ex- ecutes a class of loop bodies for a range of applications. Many applications spend the majority of their time executing in inner- most loops, and so ASICs tend to implement one or more loop bodies. By defining a single architecture to accelerate loops, the recurring costs of designing an application specific accel- erator are eliminated. The goal is to cost-effectively generalize an ASIC design to make it useful for a wider range of loops, without generalizing it to the point where it begins to look like a general purpose processor. The second step is to attack the software costs of target- ing an accelerator. Software costs primarily result from re- engineering the application once the underlying hardware has changed. To avoid these costs, we develop a software abstrac- tion that virtualizes the salient architectural features of loop ac- celerators (henceforth abbreviated LAs). An application that uses this abstraction is dynamically retargeted to take advan- tage of the accelerator if it is available in the system; however, the application will still execute correctly without any acceler- ator in the system. The tradeoff is to abstract away as many architecture-specific features as possible without requiring a significant overhead to dynamically retarget the application. The resulting design is referred to as VEAL, or virtualized execution accelerator for loops. There are two primary contri- butions of this work: • It presents the design of a novel loop accelerator architec- ture. Quantitative design space exploration ensures that the accelerator design is broad enough to accelerate many different applications, yet very efficient at executing the targeted style of computation. • It describes a dynamic algorithm for mapping loops onto loop accelerators. The algorithm is analyzed to deter- mine the runtime overheads introduced by this dynami- cally mapping loops, and static/dynamic tradeoffs are in- vestigated to mitigate the overhead. 2 Overview It is widely acknowledged that the vast majority of execu- tion time for most applications is spent in loops. Applying this fact, along with Amdahl’s Law, often leads system design- ers to construct hardware implementing loop bodies whenever