Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan Manjunath Kudlur Hyunchul Park Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan Ann Arbor, MI 48109 {fank, kvman, parkhc, mahlke}@umich.edu ABSTRACT Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or produc- ing compact code. In the context of a hardware synthesis system where the schedule is used to determine various components of the hardware, including datapath, storage, and interconnect, the goals of a scheduler change drastically. In addition to achieving the traditional goals, the scheduler must proactively make deci- sions to ensure efficient hardware is produced. This paper pro- poses two exact solutions for cost sensitive modulo scheduling, one based on an integer linear programming formulation and another based on branch-and-bound search. To achieve reasonable com- pilation times, decomposition techniques to break down the com- plex scheduling problem into phase ordered sub-problems are pro- posed. The decomposition techniques work either by partitioning the dataflow graph into smaller subgraphs and optimally schedul- ing the subgraphs, or by splitting the scheduling problem into two phases, time slot and resource assignment. The effectiveness of cost sensitive modulo scheduling in minimizing the costs of func- tion units, register structures, and interconnection wires are evalu- ated within a fully automatic synthesis system for loop accelerators. The cost sensitive modulo scheduler increases the efficiency of the resulting hardware significantly compared to both traditional cost unaware and greedy cost aware modulo schedulers. 1. INTRODUCTION The markets for cellular phones, portable digital assistants, dig- ital cameras, and other special-purpose devices continue to grow explosively. The embedded computing systems that go into these devices must meet the demands of higher performance and greater energy efficiency to support new functionality, added capabilities, more flexibility, and higher bandwidth communication. To achieve these challenging goals, application-specific hardware in the form of loop accelerators is commonly used to execute the compute- intensive portions of applications that would run too slowly if im- plemented in software on a programmable processor. Low-cost, high-performance, systematic verification, and short time-to-market are all critical objectives for designing these accelerators. Auto- matic synthesis technology to build loop accelerators from high- level specifications is critical to achieving these objectives. A key challenge with automatic synthesis is creating efficient designs. Efficiency can be defined along many axes, including per- formance, cost, and energy. For this work, the focus is on cost ef- ficiency, thus the objective is to design the lowest cost accelerator that meets a specified performance level. Cost-efficient accelera- tors are synthesized by optimizing the design in a number of ways. First, hardware structures are sized just large enough to meet the precision requirements of the application. Second, storage struc- tures (memories, registers, etc.) are given just enough entries to meet the worst-case requirements of the application. Third, hard- ware can be shared by time multiplexing hardware components when either the hardware is required under disjoint conditions or the performance of dedicated hardware is not necessary. In addi- tion to the hardware components, interconnect can also be opti- mized using the same strategies. A manual designer is typically proactive in organizing the design to maximize the savings of these general approaches and balance tradeoffs between component and interconnect cost. This work examines the construction of a loop accelerator syn- thesis system. The proposed system utilizes a compiler-directed ap- proach for designing accelerators that was derived from the PICO- NPA (Program-In Chip-Out Non-Programmable Accelerator) sys- tem [28]. The inputs to the system are a target loop nest expressed in C, the desired throughput, and the available memory bandwidth. Synthesis is divided into three steps. First, a simple, single-cluster VLIW processor is designed to meet the throughput requirements of the application. The simple processor consists of a set of ar- bitrary function units, connected to a centralized register file with unlimited entries and an unbounded memory. It provides an ab- stract target to which the compiler can efficiently map algorithms. Next, modulo scheduling is performed to map the application onto the simple processor [27]. Finally, a stylized loop accelerator is synthesized from the resulting schedule. The critical portion of the synthesis system is the modulo sched- uler. A traditional modulo scheduler attempts to map a loop onto a fixed hardware configuration, optimizing the throughput, num- ber of pipeline stages, and possibly the lifetimes of registers. In our system, the resulting schedule of operations is used to determine the complete architecture of the accelerator, including the control path, computation elements, storage structures, and interconnect. Thus, the scheduling objectives are completely changed. The scheduler must make binding decisions that lead to the most efficient design. Hence, cost sensitive modulo scheduling is proposed. The objective of cost sensitive modulo scheduling is to create a schedule that not only achieves a specified throughput, but also yields the lowest cost accelerator design. To accomplish this ob- jective, the accelerator design is modeled during scheduling, so the impact of binding decisions on cost can be assessed. Our first ap- proach to this problem utilized a greedy strategy, wherein at each scheduling step, the alternative that produced the least cost increase to the current design was made. The greedy approach was gener- ally better than the baseline cost insensitive scheduler, but not by a large amount. The scheduler got trapped in too many local minima