Enabling Multithreading on CGRAs
†
Aviral Shrivastava,
†
Jared Pager,
†
Reiley Jeyapaul,
†♯
Mahdi Hamzeh, and
♯
Sarma Vrudhula
†
Compiler Microarchiteture Lab,
♯
VLSI Electronic Design Automation Laboratory
Arizona State University, Tempe, AZ, USA
Email: {aviral.shrivastava, jppager, reiley.jeyapaul, mahdi.hamzeh, vrudhula}@asu.edu
Abstract—Coarse-Grained Reconfigurable Arrays or CGRAs
are programmable fabrics that promise both high performance
and high power efficiency. Traditionally, CGRAs were used
to accelerate extremely-embedded systems, and were typically
manually programmed. However, as CGRAs are conceived to be
used as more general-purpose accelerators, there is a need to
develop software tools and capabilities. Much work has been
done on developing compiler techniques for CGRAs, making
programming them easier; however, there is no support for
multithreading. As an accelerator to a multithreaded processor,
CGRAs now are restricted to accelerating only one kernel of
one thread running on the processor at any point in time.
Supporting multithreading is difficult, since the start times and
end times of threads are dynamic in nature, while CGRAs are
statically scheduled. In this paper, we propose a strategy to do
multithreading on a CGRA. The chief capability that we develop
is a scheme to quickly transform an existing application mapping
using the entire CGRA to one using only a fraction of it. Our
experimental results on kernels from multimedia applications
demonstrate that multithreading support can improve the total
throughput of a CGRA by over 30%, 75%, and 150% on 4x4,
6x6, and 8x8 CGRAs, respectively, compared to single-threaded
methods.
I. I NTRODUCTION
Power efficiency has become one of the most important
design metrics in many computational domains. In high per-
formance computing, performance is critically constrained by
power and thermal factors such that greater performance is
only achievable by increasing power efficiency. In addition,
power efficiency is arguably the most important metric in
determining the usability of consumer electronic devices,
such as cell phones, music players, tablets, etc. Here, power
efficiency directly translates into system weight and volume
(since battery weight and volume is the majority constituent
of system weight and volume), recharge time, and processing
frequency of the device.
Coarse-Grained Reconfigurable Arrays or CGRAs are a
promising solution for power efficient computation. A CGRA
is a grid of very efficient processors, typically nothing more
than an Arithmetic Logic Unit (ALU) and a small register file
(RF). Computation is statically mapped out on the CGRA dur-
ing compilation. Very little power is expended in performing
an operation and therefore CGRAs are very power efficient.
CGRAs have been shown to achieve power efficiencies of
10-100 GOps/W [1]. This is about 2 orders of magnitude
higher than the Intel Core i7 (quad core) processor, which
has a peak performance of 45 GOps/s, but consumes 130 W
of power, providing a power efficiency of 0.347 GOps/W [2].
Several implementations of CGRAs such as MorphoSys [1],
ADRES [3], RSPA [4], and KressArray [5] exist. [6] contains
a comprehensive summary of many of them.
Initially, CGRAs were used for fast and power efficient
processing of streaming applications in multimedia, signal pro-
cessing, and networking domains. These extremely-embedded
systems had a small set of applications, with deterministic
computation needs, allowing CGRAs to be programmed by
hand. However, as the need for power efficiency grows in all
computing domains, researchers have started to conceive the
use of CGRAs as more general-purpose accelerators. Here, the
CGRA would be a tightly-coupled accelerator to a processor
with the ability to accelerate exponentially more application
kernels than present in extremely-embedded systems. In order
to automate this process, a lot of research in developing
automated compiler techniques to map a given loop kernel
onto a CGRA, e.g., [7], [8] has been undertaken since the
turn of this century.
As an accelerator to a processor, a CGRA can only acceler-
ate one kernel of one thread running on the processor at any
given point in time. This is because CGRAs are completely
statically scheduled, while thread start and end times are
extremely dynamic in nature. CGRA compilers typically map
the loop kernel to the entire CGRA, preventing any other
thread from using the CGRA. A support for multithreading
in CGRAs will not only increase CGRA resource utilization,
and therefore throughput, but also improve the performance of
communicating tasks, and help alleviate memory bottlenecks.
A key requirement for multithreading is the ability to restrict
a given kernel to use only a portion of the CGRA. However,
at compile time, the compiler will compile using the entire
CGRA. This requires the ability to shrink an existing schedule
to use less of the CGRA at runtime. The multithreading
mechanism on the processor can then shrink and expand the
schedules dynamically as threads are invoked and finish.
One challenge in this is that the schedule transformation
problem is equivalent to the original kernel mapping problem.
However, this is difficult and the compilation time of existing
CGRA compilers is quite long, using techniques like simulated
annealing [9]. Traditionally, compile time has not been a
concern, as the applications are compiled only once and ran
indefinitely. However, to support multithreading, the schedule
transformation algorithm must be fast, since it will be used at
runtime.
In this paper, we propose an application mapping and
dynamic transformation scheme that enables multithreading
capabilities on a CGRA structure. The key idea in this
2011 International Conference on Parallel Processing
0190-3918/11 $26.00 © 2011 IEEE
DOI 10.1109/ICPP.2011.77
255