An FPGA-based Heterogeneous Coarse-Grained Dynamically Reconfigurable Architecture Ricardo Ferreira Departamento de Informatica Universidade Federal de Vicosa Vicosa, Brazil ricardo@ufv.br Julio Goldner Vendramini Departamento de Informatica Universidade Federal de Vicosa Vicosa, Brazil julio.vendramini@ufv.br Lucas Mucida Departamento de Informatica Universidade Federal de Vicosa Vicosa, Brazil lucas.mucida@ufv.br Monica M. Pereira Instituto de Informatica-PPGC Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil mmpereira@inf.ufrgs.br Luigi Carro Instituto de Informatica-PPGC Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil carro@inf.ufrgs.br ABSTRACT Coarse-grained reconfigurable architecture has emerged as a pro- mising model for embedded systems as a solution to reduce the complexity of FPGA synthesis and mapping steps, consequently reducing reconfiguration time. Despite these advantages, CGRA usage has been limited due to the lack of commercial CGRA cir- cuits. This work proposes a virtual and dynamic CGRA imple- mented on top of an FPGA. This approach allows the usage of commercial-off-the-shelf FPGA devices combined with the advan- tages of CGRAs. The proposed architecture consists of a set of heterogeneous functional units (FU) and a global interconnection network. The global network allows any FU to be used at each cycle, which reduces significantly the placement complexity. In addition, we introduce a polynomial mapping algorithm which in- cludes scheduling, placement and routing steps (SPR). Moreover, the proposed approach performs a very fast placement and rou- ting in comparison to similar CGRA approaches. The three SPR steps are computed in few milliseconds. The feasibility of this approach is demonstrated for a suite of digital signal processing benchmarks. Categories and Subject Descriptors C.1.3 [Other Architecture Styles]: Adaptable architectures, Data- flow architectures, Heterogeneous systems General Terms Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’11, October 9–14, 2011, Taipei, Taiwan. Copyright 2011 ACM 978-1-4503-0713-0/11/10 ...$10.00. Keywords Reconfigurable Architectures, CGRA, FPGA, Placement, Routing, Scheduling, Interconnections, Multistage 1. INTRODUCTION As scaling continuously increases circuit densities, increasing the amount of resources is no longer a challenge in terms of avai- lable area and cost [5]. As a consequence, many solutions are proposed to take advantage of abundant resources and increase de- vice’s efficiency. Spatial computing emerges as a solution for in- creasing performance by distributing computations in space rather than in time [8]. In fact, increasing the number of parallel process- ing elements allows concurrent operation and consequently accele- rates computation. Reconfigurable computing emerges as an alternative to reduce the time-to-market, and at the same time adds flexibility and fast prototyping for spatial and/or temporal computing [16, 10, 7]. Cur- rent FPGA devices provide flexibility by having a large number of fine-grained reconfigurable units and interconnection elements. However, one of the main challenges consists in mapping generic applications onto these complex FPGA devices. The problem is NP-complete and the current synthesis and mapping tools are CPU time-consuming [16, 25]. In fact, performing placement and rou- ting requires long time that can be in order of minutes, hours or, in the worst cases, days. This bottleneck has been one of the main challenges that prohibit the widespread use of FPGAs. Coarse-grained reconfigurable architectures (CGRA) are recon- figurable at word level (16 bits, 32 bits, etc.), while FPGAs are reconfigurable at bit level. The direct consequence of working at word level is the reduction on the number of configuration bits; the amount of time to configure; and the placement and routing com- plexity [12]. However, even for CGRA, the placement and routing is a NP-complete problem for spatial computation. In addition, when temporal computing is considered, the scheduling problem is also a NP-complete problem. Despite the advantages of CGRA, there is a lack of compiler tools and a lack of commercial devices. Most tools are specific to a subset of applications and specific architecture. Therefore, since few CGRA commercial devices are available [7], an alternative is to implement CGRA as virtual devices on top of commercial-off- 195