Accelerator-Rich CMPs: From Concept to Real Hardware
Yu-Ting Chen, Jason Cong
*
, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao
*
, Yi Zou
Computer Science Department
University of California, Los Angeles
Los Angeles, California, 90095
*
{cong, xiao}@cs.ucla.edu
Abstract—Application-specific accelerators provide 10-100x im-
provement in power efficiency over general-purpose processors. The
accelerator-rich architectures are especially promising. This work
discusses a prototype of accelerator-rich CMPs (PARC). During our
development of PARC in real hardware, we encountered a set of
technical challenges and proposed corresponding solutions. First, we
provided system IPs that serve a sea of accelerators to transfer data
between userspace and accelerator memories without cache overhead.
Second, we designed a dedicated interconnect between accelerators
and memories to enable memory sharing. Third, we implemented
an accelerator manager to virtualize accelerator resources for users.
Finally, we developed an automated flow with a number of IP templates
and customizable interfaces to a C-based synthesis flow to enable rapid
design and update of PARC. We implemented PARC in a Virtex-
6 FPGA chip with integration of platform-specific peripherals and
booting of unmodified Linux. Experimental results show that PARC
can fully exploit the energy benefits of accelerators at little system
overhead.
Keywords-customizable computing, computer architecture, prototyp-
ing, FPGA, design automation.
I. I NTRODUCTION
Accelerator-rich architectures can bring 10-100x energy effi-
ciency by offloading computation from general-purpose CPU cores
to application-specific accelerators [1–6]. On-chip accelerators can
be categorized in two classes. While tightly coupled accelerators
[1, 2] are constrained in a CPU’s pipeline and thus experience lim-
ited benefits of a full customization, loosely coupled accelerators
[3–6] completely bypass CPU overheads (instructions, caches, etc.)
and can be optimized in a larger design space. Prior work [3–6]
proposed a methodology for integrating a sea of loosely coupled
accelerators along with very few CPU cores to build up an energy-
efficient computing system. This methodology does not expect all
the accelerators to be used all the time, but it guarantees that each
computation task is executed by the most efficient hardware.
Methodologies for how to integrate massive CPU cores in a
system have been well established [7]. The related research mainly
works on the following three key issues: data transfer between
userspace and device memories, on-chip memory architecture, and
hardware resource management. These issues also arise in the
integration of massive accelerators. However very little research has
been performed on what changes should be made in the solution
when we switch from CPU-centric architectures to accelerator-
centric architectures.
This work undertakes an implementation study of a general
framework for accelerator-rich CMPs in an FPGA-based prototype.
We name our prototype of accelerator-rich CMPs as PARC. By
realizing this in RTL and running it in real hardware, we find that
many architecture assumptions and design choices used in prior
research [4–6] work only for CPU cores and lose effectiveness
when facing a sea of accelerators. We propose our own solutions,
including hardware innovation and software automation, to meet
the demand of accelerators. These solutions are:
1) Shared system IPs for accelerators to transfer data between
userspace and device memories without cache overhead.
2) A dedicated interconnect between accelerators and memories
to enable memory sharing.
3) An extensible accelerator manager in a standalone bare-
metal processor to perform runtime scheduling of accelerator
resources.
In addition, we find that due to the large scale and high het-
erogeneity of accelerator-rich architectures, their design cycles
will significantly increase if we still follow conventional design
methodologies. In our prototyping, we develop an automated flow
to enable rapid development of PARC:
1) For accelerator designers, we integrate high-level synthesis
(HLS) tools in our development flow to allow accelerators to
be designed in a high-level abstraction (ANSI C), along with
a standardized accelerator interface in HLS-compatible C.
2) For application programmers, we virtualize physical acceler-
ator resources to accelerator classes, and provide objected-
oriented APIs so that programmers can use different accel-
erators by calling member functions of different accelerator
objects.
3) For system developers, we create a fully automated flow of
system synthesis and generation from a high-level system
description file.
Our PARC is verified in a commodity FPGA chip with the
goal of providing guidelines for future ASIC implementation. We
report experimental results after running the real hardware with an
unmodified Linux. We demonstrate that PARC can fully exploit the
energy benefits of accelerators at little system overhead.
II. RELATED WORK
A good example of accelerator-rich CMPs is the Wire Speed
Processor which has four accelerators (XML, Regex, Comp and
Crypto) shared by several CPU cores [4]. A general framework
for accelerator-rich CMPs (ARC) as proposed in [3] is shown in
Fig. 1. ARC presented a hardware resource management scheme
for accelerator sharing, scheduling, and virtualization. This scheme
introduced a global accelerator manager (GAM) implemented in
hardware to support sharing and arbitration of multiple cores for
a common set of accelerators. It also proposed to use several new
custom instructions for communicating with the GAM to avoid OS
overhead in accelerator interaction.
The on-chip memory architecture of accelerator-rich CMPs is
another research focus. The necessity of on-chip memory sharing
among accelerators was reported in [5]. Later, a more complete
978-1-4799-2987-0/13/$31.00 ©2013 IEEE 169