Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott Mahlke Advanced Computer Architecture Laboratory, University of Michigan Ann Arbor, MI, USA {parkhc, yjunpark, mahlke}@umich.edu src0 const src1 route write pred FU RF opcode pred src0 8 3 4 4 4 3 3 3 9 41 Bits PE PE PE PE Central RF PE PE PE PE PE PE PE PE PE PE PE PE Config Memory src1 route write waddr raddr const Figure 1: CGRA overview: 4x4 array of PEs (left), a detailed view of a PE (right), and a PE instruction (bottom) 1. INTRODUCTION Today’s mobile applications are multimedia rich, involving sig- niﬁcant amounts of audio and video coding, 3D graphics, signal processing, and communications. These multimedia applications usually have a large number of kernels in which most of the execu- tion time is spent. Traditionally, these compute-intensive kernels were accelerated by application speciﬁc hardware in the form of ASICs to meet the competing demands of high performance and energy efﬁciency. However, increasing convergence of different functionalities combined with high non-recurring costs involved in designing ASICs have pushed designers towards programmable solutions. Coarse-grained reconﬁgurable architectures (CGRAs) are be- coming attractive alternatives because they offer large raw compu- tation capabilities with low cost/energy implementations. CGRAs generally consist of an array of a large number of function units (FUs) interconnected by a mesh style network(Figure 1). Register ﬁles are distributed throughout the CGRA to hold temporary val- ues and are accessible only by a small subset of FUs. The FUs can execute common word-level operations, including addition, sub- traction, and multiplication. A major bottleneck for deploying CGRAs into a wider domain of embedded devices lies in the control path. The appealing fea- tures in the datapath of CGRAs ironically come back as a major overhead in the control path. The distributed interconnect and reg- ister ﬁles require a large number of conﬁguration bits to route val- ues across the network. The abundance of computation resources simply adds up the list for conﬁgurations to the control path. As a result, the total number of control bits to conﬁgure the whole array can reach nearly 1000 bits each cycle, and the control path takes up to 43% of the total power consumption in existing CGRA designs [2, 1]. Moreover, control bits are read from the on-chip memory every cycle regardless of the array’s utilization. To our knowledge, no previous work has addressed a general solution for power-efﬁcient control path design in tiled accelerators like CGRAs. In this paper, we propose a new control path design that improves the code efﬁciency of CGRAs by leveraging token net- works originally proposed for dataﬂow machines. 2. MOTIVATION Conventionally, code compression is performed at the instruc- tion level with no-op compression or a variable length encoding. No-op compression is widely used in VLIW processors and many DSPs [4]. However, instruction-level compression does not work well in CGRAs due to the highly distributed nature of the re- sources. We discovered that only 17% of PE instructions are pure no-ops (all the components in the same PE is not active), while the (a) (b) Token Network decoded inst w/ dest Config Memory Decoder encoded inst IF to datapath decoded inst format Config Memory Decoder Config Memory to datapath decoded inst (c) decoded inst w/ src to datapath format encoded inst F R F R F R F R F R F R F R F R F R F R F R F R F R F R F R F R Figure 2: Different Control Path Designs: (a) No compres- sion, (b) Fine-grain code compression with static instruction format, (c) Fine-grain code compression with a token network (F and R indicate FU token module and RF token module, re- spectively) average utilization of FUs is 55%. However, there is a good opportunity for a ﬁne-grain code com- pression: compressing instruction ﬁelds (e.g., opcode, MUX se- lection, register address) rather than the whole instruction. On average, only 35% of all instruction ﬁelds contain valid data, thus efﬁciency can potentially be increased by removing unused ﬁelds. Slide 6 shows a high-level organization that utilizes a static ﬁne- grain compression approach. In the simplest variant, presence bits are added for each ﬁeld to indicate whether the ﬁeld exists or not. Instruction encoding consists of the presence bits(instruction for- mat) followed by the subset of valid instruction ﬁelds concatenated together. With this approach, decoding can become complex due to the variable length nature of the encoding, but all unused ﬁelds can be removed in principle. The biggest challenge for applying static ﬁne-grain compres- sion lies in the instruction formats. Using a simple ﬁne-grain static compression scheme that we designed for a CGRA, the code ef- ﬁciency increases by 24% with the average number of instruction bits decreasing from 845 to 647. However, 172 of the 647 bits are used for encoding the instruction formats. Since the instruction format of 172 bits needs be read from the conﬁguration memory every cycle regardless of the number of ﬁelds present, the instruc- tion format itself becomes a signiﬁcant overhead in the control path. To address this limitation, we propose to dynamically dis- cover the instruction formats by applying a dataﬂow token net- work explained in the next Section. 3. TOKEN NETWORK 3.1 Concepts The basic idea of dynamic instruction format discovery is that resources need conﬁgurations only when there is useful data that ﬂows through them. By looking at the locations of data coming into a PE, we can infer the instruction format of the current in- struction. We can utilize a token network in dataﬂow machines [3] to provide information on where data ﬂows in the distributed net- work. A token is sent from a producer to its consumers one cycle ahead of the actual data execution. Originally, the consumer ﬁred when it accumulated sufﬁcient tokens. However, this concept can be altered as all tokens for a single instruction are guaranteed to arrive at the same time. Hence, the set of tokens uniquely de- termine the instruction format so that the necessary ﬁelds can be fetched from the instruction memory. When the actual data arrives