Enabling Multithreading on CGRAs Aviral Shrivastava, Jared Pager, Reiley Jeyapaul, Mahdi Hamzeh, and Sarma Vrudhula Compiler Microarchiteture Lab, VLSI Electronic Design Automation Laboratory Arizona State University, Tempe, AZ, USA Email: {aviral.shrivastava, jppager, reiley.jeyapaul, mahdi.hamzeh, vrudhula}@asu.edu Abstract—Coarse-Grained Reconfigurable Arrays or CGRAs are programmable fabrics that promise both high performance and high power efficiency. Traditionally, CGRAs were used to accelerate extremely-embedded systems, and were typically manually programmed. However, as CGRAs are conceived to be used as more general-purpose accelerators, there is a need to develop software tools and capabilities. Much work has been done on developing compiler techniques for CGRAs, making programming them easier; however, there is no support for multithreading. As an accelerator to a multithreaded processor, CGRAs now are restricted to accelerating only one kernel of one thread running on the processor at any point in time. Supporting multithreading is difficult, since the start times and end times of threads are dynamic in nature, while CGRAs are statically scheduled. In this paper, we propose a strategy to do multithreading on a CGRA. The chief capability that we develop is a scheme to quickly transform an existing application mapping using the entire CGRA to one using only a fraction of it. Our experimental results on kernels from multimedia applications demonstrate that multithreading support can improve the total throughput of a CGRA by over 30%, 75%, and 150% on 4x4, 6x6, and 8x8 CGRAs, respectively, compared to single-threaded methods. I. I NTRODUCTION Power efficiency has become one of the most important design metrics in many computational domains. In high per- formance computing, performance is critically constrained by power and thermal factors such that greater performance is only achievable by increasing power efficiency. In addition, power efficiency is arguably the most important metric in determining the usability of consumer electronic devices, such as cell phones, music players, tablets, etc. Here, power efficiency directly translates into system weight and volume (since battery weight and volume is the majority constituent of system weight and volume), recharge time, and processing frequency of the device. Coarse-Grained Reconfigurable Arrays or CGRAs are a promising solution for power efficient computation. A CGRA is a grid of very efficient processors, typically nothing more than an Arithmetic Logic Unit (ALU) and a small register file (RF). Computation is statically mapped out on the CGRA dur- ing compilation. Very little power is expended in performing an operation and therefore CGRAs are very power efficient. CGRAs have been shown to achieve power efficiencies of 10-100 GOps/W [1]. This is about 2 orders of magnitude higher than the Intel Core i7 (quad core) processor, which has a peak performance of 45 GOps/s, but consumes 130 W of power, providing a power efficiency of 0.347 GOps/W [2]. Several implementations of CGRAs such as MorphoSys [1], ADRES [3], RSPA [4], and KressArray [5] exist. [6] contains a comprehensive summary of many of them. Initially, CGRAs were used for fast and power efficient processing of streaming applications in multimedia, signal pro- cessing, and networking domains. These extremely-embedded systems had a small set of applications, with deterministic computation needs, allowing CGRAs to be programmed by hand. However, as the need for power efficiency grows in all computing domains, researchers have started to conceive the use of CGRAs as more general-purpose accelerators. Here, the CGRA would be a tightly-coupled accelerator to a processor with the ability to accelerate exponentially more application kernels than present in extremely-embedded systems. In order to automate this process, a lot of research in developing automated compiler techniques to map a given loop kernel onto a CGRA, e.g., [7], [8] has been undertaken since the turn of this century. As an accelerator to a processor, a CGRA can only acceler- ate one kernel of one thread running on the processor at any given point in time. This is because CGRAs are completely statically scheduled, while thread start and end times are extremely dynamic in nature. CGRA compilers typically map the loop kernel to the entire CGRA, preventing any other thread from using the CGRA. A support for multithreading in CGRAs will not only increase CGRA resource utilization, and therefore throughput, but also improve the performance of communicating tasks, and help alleviate memory bottlenecks. A key requirement for multithreading is the ability to restrict a given kernel to use only a portion of the CGRA. However, at compile time, the compiler will compile using the entire CGRA. This requires the ability to shrink an existing schedule to use less of the CGRA at runtime. The multithreading mechanism on the processor can then shrink and expand the schedules dynamically as threads are invoked and finish. One challenge in this is that the schedule transformation problem is equivalent to the original kernel mapping problem. However, this is difficult and the compilation time of existing CGRA compilers is quite long, using techniques like simulated annealing [9]. Traditionally, compile time has not been a concern, as the applications are compiled only once and ran indefinitely. However, to support multithreading, the schedule transformation algorithm must be fast, since it will be used at runtime. In this paper, we propose an application mapping and dynamic transformation scheme that enables multithreading capabilities on a CGRA structure. The key idea in this 2011 International Conference on Parallel Processing 0190-3918/11 $26.00 © 2011 IEEE DOI 10.1109/ICPP.2011.77 255