Accelerator-Rich CMPs: From Concept to Real Hardware Yu-Ting Chen, Jason Cong * , Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao * , Yi Zou Computer Science Department University of California, Los Angeles Los Angeles, California, 90095 * {cong, xiao}@cs.ucla.edu Abstract—Application-specific accelerators provide 10-100x im- provement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex- 6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead. Keywords-customizable computing, computer architecture, prototyp- ing, FPGA, design automation. I. I NTRODUCTION Accelerator-rich architectures can bring 10-100x energy effi- ciency by offloading computation from general-purpose CPU cores to application-specific accelerators [1–6]. On-chip accelerators can be categorized in two classes. While tightly coupled accelerators [1, 2] are constrained in a CPU’s pipeline and thus experience lim- ited benefits of a full customization, loosely coupled accelerators [3–6] completely bypass CPU overheads (instructions, caches, etc.) and can be optimized in a larger design space. Prior work [3–6] proposed a methodology for integrating a sea of loosely coupled accelerators along with very few CPU cores to build up an energy- efficient computing system. This methodology does not expect all the accelerators to be used all the time, but it guarantees that each computation task is executed by the most efficient hardware. Methodologies for how to integrate massive CPU cores in a system have been well established [7]. The related research mainly works on the following three key issues: data transfer between userspace and device memories, on-chip memory architecture, and hardware resource management. These issues also arise in the integration of massive accelerators. However very little research has been performed on what changes should be made in the solution when we switch from CPU-centric architectures to accelerator- centric architectures. This work undertakes an implementation study of a general framework for accelerator-rich CMPs in an FPGA-based prototype. We name our prototype of accelerator-rich CMPs as PARC. By realizing this in RTL and running it in real hardware, we find that many architecture assumptions and design choices used in prior research [4–6] work only for CPU cores and lose effectiveness when facing a sea of accelerators. We propose our own solutions, including hardware innovation and software automation, to meet the demand of accelerators. These solutions are: 1) Shared system IPs for accelerators to transfer data between userspace and device memories without cache overhead. 2) A dedicated interconnect between accelerators and memories to enable memory sharing. 3) An extensible accelerator manager in a standalone bare- metal processor to perform runtime scheduling of accelerator resources. In addition, we find that due to the large scale and high het- erogeneity of accelerator-rich architectures, their design cycles will significantly increase if we still follow conventional design methodologies. In our prototyping, we develop an automated flow to enable rapid development of PARC: 1) For accelerator designers, we integrate high-level synthesis (HLS) tools in our development flow to allow accelerators to be designed in a high-level abstraction (ANSI C), along with a standardized accelerator interface in HLS-compatible C. 2) For application programmers, we virtualize physical acceler- ator resources to accelerator classes, and provide objected- oriented APIs so that programmers can use different accel- erators by calling member functions of different accelerator objects. 3) For system developers, we create a fully automated flow of system synthesis and generation from a high-level system description file. Our PARC is verified in a commodity FPGA chip with the goal of providing guidelines for future ASIC implementation. We report experimental results after running the real hardware with an unmodified Linux. We demonstrate that PARC can fully exploit the energy benefits of accelerators at little system overhead. II. RELATED WORK A good example of accelerator-rich CMPs is the Wire Speed Processor which has four accelerators (XML, Regex, Comp and Crypto) shared by several CPU cores [4]. A general framework for accelerator-rich CMPs (ARC) as proposed in [3] is shown in Fig. 1. ARC presented a hardware resource management scheme for accelerator sharing, scheduling, and virtualization. This scheme introduced a global accelerator manager (GAM) implemented in hardware to support sharing and arbitration of multiple cores for a common set of accelerators. It also proposed to use several new custom instructions for communicating with the GAM to avoid OS overhead in accelerator interaction. The on-chip memory architecture of accelerator-rich CMPs is another research focus. The necessity of on-chip memory sharing among accelerators was reported in [5]. Later, a more complete 978-1-4799-2987-0/13/$31.00 ©2013 IEEE 169