OpenMP extensions for FPGA Accelerators
Daniel Cabrera
1,2
, Xavier Martorell
1,2
, Georgi Gaydadjiev
3
, Eduard Ayguade
1,2
, Daniel Jim´ enez-Gonz´ alez
1,2
1
Barcelona Supercomputing Center
c/Jordi Girona 31,
Torre Girona,
E-08034 Barcelona, Spain
2
Universitat Politecnica de Catalunya
c/Jordi Girona 1-3,
Campus Nord-UPC, Modul C6,
E-08034 Barcelona, Spain
{dcabrera, xavim, eduard, djimenez}@ac.upc.edu
3
Delft University of Technology
Mekelweg 4,
2628 CD Delft,
The Netherlands
g.n.gaydadjiev@its.tudelft.nl
Abstract—Reconfigurable computing is one of the paths to
explore towards low-power supercomputing. However, pro-
gramming these reconfigurable devices is not an easy task and
still requires significant research and development efforts to
make it really productive. In addition, the use of these devices
as accelerators in multicore, SMPs and ccNUMA architectures
adds an additional level of programming complexity in order to
specify the offloading of tasks to reconfigurable devices and the
interoperability with current shared-memory programming
paradigms such as OpenMP. This paper presents extensions
to OpenMP 3.0 that try to address this second challenge and
an implementation in a prototype runtime system. With these
extensions the programmer can easily express the offloading
of an already existing reconfigurable binary code (bitstream)
hiding all the complexities related with device configuration,
bitstream loading, data arrangement and movement to the
device memory. Our current prototype implementation targets
the SGI Altix systems with RASC blades (based on the Virtex
4 FPGA). We analyze the overheads introduced in this imple-
mentation and propose a hybrid host/device operational mode
to hide some of these overheads, significantly improving the
performance of the applications. A complete evaluation of the
system is done with a matrix multiplication kernel, including
an estimation considering different FPGA frequencies.
I. I NTRODUCTION
The gigahertz race to which we were used to in the last
decade has stopped due to power dissipation problems. The
extra transistors that are available for new designs are not
used to increase the complexity of superscalar architectures,
out of order or multithreaded. The technological increase in
transistor count is used to include more that one core in the
same chip (homogeneous multicore) and/or to incorporate
accelerators (heterogeneous multicore) well suited for cer-
tain application domains, such as for example GPU units in
[1] or vector units in the Cell/B.E.[2]. For these accelerators,
the exploitation of the potential parallelism available is not
an easy task, and relies on the use of specific SDKs.
The use of specialized devices designed to compute some
specific function (ASIC circuits) is another alternative to
benefit a specific kind of applications. For example an ASIC
to compute fast Fourier transforms can clearly eliminate
the computation bottlenecks found in some bioinformatics
applications. Field Programmable Gate Arrays (FPGA) are
accelerators whose specific functionality can be retargetted
to different domains at runtime. However, efficiently pro-
gramming these specific functionalities requires the use of
low-level hardware description languages (HDL), such as
Verilog or VHDL, to which general-purpose programmers
are not used to.
The productive parallelization of applications for het-
erogenous multicore architectures that include one or more
of such accelerators requires programming models able to
express the proper of oading of tasks and the data that is
needed to perform the computation. This is the purpose
of this paper, and in particular, to show a proposal that
extends OpenMP 3.0 tasking [3] to target heterogeneous
architectures with FPGA-based accelerators. OpenMP 3.0
task pragmas completely fits with the idea of using one or
more FPGAs as accelerators. In this paper we are assuming
that the bitstreams that corresponds to the computations to
be of oaded in tasks are either existing IP cores or are gen-
erated using other compilation tools. This may impose some
restrictions in the behavior of the tasks to be of oaded, such
as for example on the use of synchronization constructs.
In order to motivate our extensions to OpenMP 3.0 and
their implementation in the runtime system, Figure 1 shows
part of the code that is necessary to of oad the execution
of a matrix multiplication bitstream matmul fpga to one of
the FPGAs available on the SGI RASC architecture [4],
using the SGI RASClib library [5]. In addition to this, the
programmer needs to change the memory association of
data in the host when transfers to/from the FPGA device
978-1-4244-4501-1/09/$25.00 ©2009 IEEE 17
Authorized licensed use limited to: UNIVERSITAT POLITÈCNICA DE CATALUNYA. Downloaded on June 10,2010 at 10:17:44 UTC from IEEE Xplore. Restrictions apply.