Application Acceleration on FPGAs with OmpSs@FPGA Jaume Bosch *† , Xubin Tan *† , Antonio Filgueras * , Miquel Vidal * , Marc Mateu *† , Daniel Jim´ enez-Gonz´ alez *† , Carlos ´ Alvarez *† , Xavier Martorell *† , Eduard Ayguad´ e *† , and Jesus Labarta *† * Computer Science Dept. Barcelona Supercomputing Center Barcelona, Spain Email: name.surname@bsc.es † Computer Architecture Dept. Universitat Polit` ecnica de Catalunya Barcelona, Spain Email: djimenez,calvarez,xavim@ac.upc.edu Abstract—OmpSs@FPGA is the ﬂavor of OmpSs that allows ofﬂoading application functionality to FPGAs. Sim- ilarly to OpenMP, it is based on compiler directives. While the OpenMP speciﬁcation also includes support for hetero- geneous execution, we use OmpSs and OmpSs@FPGA as prototype implementation to develop new ideas for OpenMP. OmpSs@FPGA implements the tasking model with runtime support to automatically exploit all SMP and FPGA re- sources available in the execution platform. In this paper, we present the OmpSs@FPGA ecosystem, based on the Mercurium compiler and the Nanos++ runtime system. We show how the applications are transformed to run on the SMP cores and the FPGA. The application kernels deﬁned as tasks to be accelerated, using the OmpSs directives are: 1) transformed by the compiler into kernels connected with the proper synchronization and communication ports, 2) extracted to intermediate ﬁles, 3) compiled through the FPGA vendor HLS tool, and 4) used to conﬁgure the FPGA. Our Nanos++ runtime system schedules the application tasks on the platform, being able to use the SMP cores and the FPGA accelerators at the same time. We present the evaluation of the OmpSs@FPGA environ- ment with the Matrix Multiplication, Cholesky and N-Body benchmarks, showing the internal details of the execution, and the performance obtained on a Zynq Ultrascale+ MPSoC (up to 128x). The source code uses OmpSs@FPGA annota- tions and different Vivado HLS optimization directives are applied for acceleration. Keywords-Heterogeneous Parallelism; OmpSs; FPGAs; I. INTRODUCTION Current trends in computer architecture are focused on providing heterogeneous execution environments. Hetero- geneity comes in many different ﬂavors. One important ﬂavor is an environment that incorporates accelerators within an FPGA (Field-Programmable Gate Array), pro- viding specialized hardware to better execute speciﬁc algorithms. FPGA devices are programmed by means of bitstreams, usually generated by vendor-proprietary tools, following an speciﬁcation provided in the VHDL or Verilog hardware description languages. In addition, there is an additional characteristic to be taken into account: Vendor compilation tools to generate the place and route to conﬁgure the FPGA usually take from minutes to hours. This causes that the porting of new code onto these platforms is usually a slow process. Vendors also provide FPGAs integrated with a few cores, that can be used as the host cores. In this case, the FPGA shares the physical memory with the cores. FPGA modules (accelerators from now on) may have additional but limited amount of local memory. This accelerator local memory may be needed in order to achieve high performance accelerators and, in this case, data movements are a must for them to work. On the other hand, the limited amount of memory forces the use of blocking techniques when the workload does not ﬁt on the FPGA resources. Therefore, memory transfers from/to host memory to/from accelerator local memory should be optimized enough to reduce the communication overhead or be overlapped, with or without blocking execution, with the accelerator computation in order to hide it. Related to those memory transfers, the FPGA device incorporates the implementation of the bus protocol as part of its programming to perform FPGA external accesses. Thus, the programmer needs to be aware of it and should incorporate it in the bitstream generation process. In order to reduce the programmer effort, models have to provide the means to perform/indicate data transfers between host and accelerators in an easy way, allowing to reduce the impact of those communications. In our work, OmpSs@FPGA ecosystem addresses pre- vious challenges achieving high productivity by providing higher-level abstractions that could help the programmer to generate high performance code on them. For example: • Making the memory allocation and data copies auto- matic, based on directives. • Providing the programmer facilities to perform block- ing from inside the accelerators. • Automating the code generation of the CPU and FPGA binaries, provided the C/C++ implementation, by transparently running open or vendor tools. • Allowing the use of parallelism based on tasking (instead of kernel invocations). • Providing support for data dependent tasks, and man- aging the execution based on such data dependences. • Providing FPGA execution trace generation support. This makes the programming environment to (hope- fully) completely hide the target architectures, providing a clean, high-level, abstract interface to the programmers, and incorporating all the intelligence on management and scheduling onto the runtime system.