Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott Mahlke Advanced Computer Architecture Laboratory, University of Michigan Ann Arbor, MI, USA {parkhc, yjunpark, mahlke}@umich.edu src0 const src1 route write pred FU RF opcode pred src0 8 3 4 4 4 3 3 3 9 41 Bits PE PE PE PE Central RF PE PE PE PE PE PE PE PE PE PE PE PE Config Memory src1 route write waddr raddr const Figure 1: CGRA overview: 4x4 array of PEs (left), a detailed view of a PE (right), and a PE instruction (bottom) 1. INTRODUCTION Today’s mobile applications are multimedia rich, involving sig- nificant amounts of audio and video coding, 3D graphics, signal processing, and communications. These multimedia applications usually have a large number of kernels in which most of the execu- tion time is spent. Traditionally, these compute-intensive kernels were accelerated by application specific hardware in the form of ASICs to meet the competing demands of high performance and energy efficiency. However, increasing convergence of different functionalities combined with high non-recurring costs involved in designing ASICs have pushed designers towards programmable solutions. Coarse-grained reconfigurable architectures (CGRAs) are be- coming attractive alternatives because they offer large raw compu- tation capabilities with low cost/energy implementations. CGRAs generally consist of an array of a large number of function units (FUs) interconnected by a mesh style network(Figure 1). Register files are distributed throughout the CGRA to hold temporary val- ues and are accessible only by a small subset of FUs. The FUs can execute common word-level operations, including addition, sub- traction, and multiplication. A major bottleneck for deploying CGRAs into a wider domain of embedded devices lies in the control path. The appealing fea- tures in the datapath of CGRAs ironically come back as a major overhead in the control path. The distributed interconnect and reg- ister files require a large number of configuration bits to route val- ues across the network. The abundance of computation resources simply adds up the list for configurations to the control path. As a result, the total number of control bits to configure the whole array can reach nearly 1000 bits each cycle, and the control path takes up to 43% of the total power consumption in existing CGRA designs [2, 1]. Moreover, control bits are read from the on-chip memory every cycle regardless of the array’s utilization. To our knowledge, no previous work has addressed a general solution for power-efficient control path design in tiled accelerators like CGRAs. In this paper, we propose a new control path design that improves the code efficiency of CGRAs by leveraging token net- works originally proposed for dataflow machines. 2. MOTIVATION Conventionally, code compression is performed at the instruc- tion level with no-op compression or a variable length encoding. No-op compression is widely used in VLIW processors and many DSPs [4]. However, instruction-level compression does not work well in CGRAs due to the highly distributed nature of the re- sources. We discovered that only 17% of PE instructions are pure no-ops (all the components in the same PE is not active), while the (a) (b) Token Network decoded inst w/ dest Config Memory Decoder encoded inst IF to datapath decoded inst format Config Memory Decoder Config Memory to datapath decoded inst (c) decoded inst w/ src to datapath format encoded inst F R F R F R F R F R F R F R F R F R F R F R F R F R F R F R F R Figure 2: Different Control Path Designs: (a) No compres- sion, (b) Fine-grain code compression with static instruction format, (c) Fine-grain code compression with a token network (F and R indicate FU token module and RF token module, re- spectively) average utilization of FUs is 55%. However, there is a good opportunity for a fine-grain code com- pression: compressing instruction fields (e.g., opcode, MUX se- lection, register address) rather than the whole instruction. On average, only 35% of all instruction fields contain valid data, thus efficiency can potentially be increased by removing unused fields. Slide 6 shows a high-level organization that utilizes a static fine- grain compression approach. In the simplest variant, presence bits are added for each field to indicate whether the field exists or not. Instruction encoding consists of the presence bits(instruction for- mat) followed by the subset of valid instruction fields concatenated together. With this approach, decoding can become complex due to the variable length nature of the encoding, but all unused fields can be removed in principle. The biggest challenge for applying static fine-grain compres- sion lies in the instruction formats. Using a simple fine-grain static compression scheme that we designed for a CGRA, the code ef- ficiency increases by 24% with the average number of instruction bits decreasing from 845 to 647. However, 172 of the 647 bits are used for encoding the instruction formats. Since the instruction format of 172 bits needs be read from the configuration memory every cycle regardless of the number of fields present, the instruc- tion format itself becomes a significant overhead in the control path. To address this limitation, we propose to dynamically dis- cover the instruction formats by applying a dataflow token net- work explained in the next Section. 3. TOKEN NETWORK 3.1 Concepts The basic idea of dynamic instruction format discovery is that resources need configurations only when there is useful data that flows through them. By looking at the locations of data coming into a PE, we can infer the instruction format of the current in- struction. We can utilize a token network in dataflow machines [3] to provide information on where data flows in the distributed net- work. A token is sent from a producer to its consumers one cycle ahead of the actual data execution. Originally, the consumer fired when it accumulated sufficient tokens. However, this concept can be altered as all tokens for a single instruction are guaranteed to arrive at the same time. Hence, the set of tokens uniquely de- termine the instruction format so that the necessary fields can be fetched from the instruction memory. When the actual data arrives