Throughput-Oriented Kernel Porting onto FPGAs Alexandros Papakonstantinou ECE Department University of Illinois Urbana-Champaign, IL, USA apapako2@illinois.edu Deming Chen ECE Department University of Illinois Urbana-Champaign, IL, USA dchen@illinois.edu Wen-Mei Hwu ECE Department University of Illinois Urbana-Champaign, IL, USA w-hwu@illinois.edu Jason Cong CS Department University of California Los Angeles, California, USA cong@cs.ucla.edu Yun Liang EECS School Peking University Beijing, China ericlyun@pku.edu.cn ABSTRACT Reconfigurable devices are often employed in heterogeneous systems due to their low power and parallel processing advan- tages. An important usability requirement is the support of a homogeneous programming interface. Nevertheless, homoge- neous programming interfaces do not eliminate the need for code tweaking to enable efficient mapping of the computation across heterogeneous architectures. In this work we propose a code optimization framework which analyzes and restructures CUDA kernels that are optimized for GPU devices in order to facilitate synthesis of high-throughput custom accelerators on FPGAs. The proposed framework enables efficient perfor- mance porting without manual code tweaking or annotation by the user. A hierarchical region graph in tandem with code motions and graph coloring of array variables is employed to restructure the kernel for high throughput execution on FP- GAs. 1. INTRODUCTION Tighter integration of latency oriented CPUs with through- put oriented compute architectures with massive parallelism and low power characteristics is becoming common in many compute domains (e.g. mobile, high-performance, compute clusters, etc) [1, 17, 16]. Programming efficiency is a prerequi- site for leveraging the benefits of heterogeneous systems. The introduction of parallel programming models and semantics such as CUDA [15], OpenCL [2] and OpenACC [3] addresses the need for programming heterogeneous processors through a homogeneous programming interface. Homogeneous pro- gramming models facilitate functionality porting but often necessitate device-specific code tweaking to achieve perfor- mance porting. In this work we propose a throughput oriented performance porting (TOPP) framework that leverages code restructur- ing techniques to enable automatic performance porting of CUDA kernels onto FPGAs. CUDA offers explicit control over (i) data memory spaces, (ii) computation distribution across cores, and (iii) thread synchronization. Hence, CUDA kernels designed for the GPU architecture may not map effi- ciently on reconfigurable devices. The TOPP framework pro- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC ’13, May 29 - June 07 2013, Austin, TX, USA. Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00. posed in this work, leverages the hierarchical region graph (HRG) representation to efficiently analyse and restructure the kernel code. Restructuring entails a wide range of trans- formations including code motions, synchronization elimina- tion (through array renaming), data communication elimina- tion (through rematerialization), and idle thread elimination (through control flow fusion and loop interchange). As data handling plays a critical role in the performance of massively parallel CUDA kernels, the proposed flow employs advanced dataflow and symbolic analysis techniques to efficiently man- age data. Graph coloring in tandem with throughput esti- mation techniques is used to optimize kernel data structure allocation and utilization of on-chip memories. Through or- chestration of different code transformation and optimization techniques, the TOPP framework generates C code which is fed to high-level synthesis (HLS) to generate high-throughput custom accelerators on the reconfigurable architecture. Our experimental study shows that the proposed flow improves FPGA execution performance by more than 4X without man- ual code tweaking from the user. The main contributions of this work are summarized below: • Introduction of the hierarchical region graph represen- tation of CUDA kernels. • Implementation of an automated performance porting flow from CUDA to FPGAs. • Description of efficient throughput metrics for through- put oriented kernel restructuring. • Experimental evaluation of the performance porting ca- pability of the TOPP framework. In the next Section we provide further background infor- mation on CUDA-to-FPGA flows and introduce the HRG representation. Section 3 offers an overview of the TOPP framework which is complemented by algorithms and other implementation details in the Appendices. Finally, Section 4 contains the experimental evaluation of TOPP followed by conclusion in Section 5. 2. MOTIVATION AND BACKGROUND CUDA employs a SIMT (single instruction, multiple threads) parallel programming interface which efficiently expresses mul- tiple fine-grained threads executing as groups of cooperative thread arrays (CTA). The GPU architecture comprises high- throughput compute cores grouped in Streaming Multiproces- sors (SMs). Computation is distributed across SMs at CTA granularity [15]. A carefully crafted interconnect scheme be- tween SMs and off-chip memory facilitates high-bandwidth data accesses at low latency overhead.