OpenMP extensions for FPGA Accelerators Daniel Cabrera 1,2 , Xavier Martorell 1,2 , Georgi Gaydadjiev 3 , Eduard Ayguade 1,2 , Daniel Jim´ enez-Gonz´ alez 1,2 1 Barcelona Supercomputing Center c/Jordi Girona 31, Torre Girona, E-08034 Barcelona, Spain 2 Universitat Politecnica de Catalunya c/Jordi Girona 1-3, Campus Nord-UPC, Modul C6, E-08034 Barcelona, Spain {dcabrera, xavim, eduard, djimenez}@ac.upc.edu 3 Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands g.n.gaydadjiev@its.tudelft.nl Abstract—Reconﬁgurable computing is one of the paths to explore towards low-power supercomputing. However, pro- gramming these reconﬁgurable devices is not an easy task and still requires signiﬁcant research and development efforts to make it really productive. In addition, the use of these devices as accelerators in multicore, SMPs and ccNUMA architectures adds an additional level of programming complexity in order to specify the ofﬂoading of tasks to reconﬁgurable devices and the interoperability with current shared-memory programming paradigms such as OpenMP. This paper presents extensions to OpenMP 3.0 that try to address this second challenge and an implementation in a prototype runtime system. With these extensions the programmer can easily express the ofﬂoading of an already existing reconﬁgurable binary code (bitstream) hiding all the complexities related with device conﬁguration, bitstream loading, data arrangement and movement to the device memory. Our current prototype implementation targets the SGI Altix systems with RASC blades (based on the Virtex 4 FPGA). We analyze the overheads introduced in this imple- mentation and propose a hybrid host/device operational mode to hide some of these overheads, signiﬁcantly improving the performance of the applications. A complete evaluation of the system is done with a matrix multiplication kernel, including an estimation considering different FPGA frequencies. I. I NTRODUCTION The gigahertz race to which we were used to in the last decade has stopped due to power dissipation problems. The extra transistors that are available for new designs are not used to increase the complexity of superscalar architectures, out of order or multithreaded. The technological increase in transistor count is used to include more that one core in the same chip (homogeneous multicore) and/or to incorporate accelerators (heterogeneous multicore) well suited for cer- tain application domains, such as for example GPU units in [1] or vector units in the Cell/B.E.[2]. For these accelerators, the exploitation of the potential parallelism available is not an easy task, and relies on the use of speciﬁc SDKs. The use of specialized devices designed to compute some speciﬁc function (ASIC circuits) is another alternative to beneﬁt a speciﬁc kind of applications. For example an ASIC to compute fast Fourier transforms can clearly eliminate the computation bottlenecks found in some bioinformatics applications. Field Programmable Gate Arrays (FPGA) are accelerators whose speciﬁc functionality can be retargetted to different domains at runtime. However, efﬁciently pro- gramming these speciﬁc functionalities requires the use of low-level hardware description languages (HDL), such as Verilog or VHDL, to which general-purpose programmers are not used to. The productive parallelization of applications for het- erogenous multicore architectures that include one or more of such accelerators requires programming models able to express the proper of oading of tasks and the data that is needed to perform the computation. This is the purpose of this paper, and in particular, to show a proposal that extends OpenMP 3.0 tasking [3] to target heterogeneous architectures with FPGA-based accelerators. OpenMP 3.0 task pragmas completely ﬁts with the idea of using one or more FPGAs as accelerators. In this paper we are assuming that the bitstreams that corresponds to the computations to be of oaded in tasks are either existing IP cores or are gen- erated using other compilation tools. This may impose some restrictions in the behavior of the tasks to be of oaded, such as for example on the use of synchronization constructs. In order to motivate our extensions to OpenMP 3.0 and their implementation in the runtime system, Figure 1 shows part of the code that is necessary to of oad the execution of a matrix multiplication bitstream matmul fpga to one of the FPGAs available on the SGI RASC architecture [4], using the SGI RASClib library [5]. In addition to this, the programmer needs to change the memory association of data in the host when transfers to/from the FPGA device 978-1-4244-4501-1/09/$25.00 ©2009 IEEE 17 Authorized licensed use limited to: UNIVERSITAT POLITÈCNICA DE CATALUNYA. Downloaded on June 10,2010 at 10:17:44 UTC from IEEE Xplore. Restrictions apply.