Throughput-Oriented Kernel Porting onto FPGAs Alexandros Papakonstantinou ECE Department University of Illinois Urbana-Champaign, IL, USA apapako2@illinois.edu Deming Chen ECE Department University of Illinois Urbana-Champaign, IL, USA dchen@illinois.edu Wen-Mei Hwu ECE Department University of Illinois Urbana-Champaign, IL, USA w-hwu@illinois.edu Jason Cong CS Department University of California Los Angeles, California, USA cong@cs.ucla.edu Yun Liang EECS School Peking University Beijing, China ericlyun@pku.edu.cn ABSTRACT Reconﬁgurable devices are often employed in heterogeneous systems due to their low power and parallel processing advan- tages. An important usability requirement is the support of a homogeneous programming interface. Nevertheless, homoge- neous programming interfaces do not eliminate the need for code tweaking to enable eﬃcient mapping of the computation across heterogeneous architectures. In this work we propose a code optimization framework which analyzes and restructures CUDA kernels that are optimized for GPU devices in order to facilitate synthesis of high-throughput custom accelerators on FPGAs. The proposed framework enables eﬃcient perfor- mance porting without manual code tweaking or annotation by the user. A hierarchical region graph in tandem with code motions and graph coloring of array variables is employed to restructure the kernel for high throughput execution on FP- GAs. 1. INTRODUCTION Tighter integration of latency oriented CPUs with through- put oriented compute architectures with massive parallelism and low power characteristics is becoming common in many compute domains (e.g. mobile, high-performance, compute clusters, etc) [1, 17, 16]. Programming eﬃciency is a prerequi- site for leveraging the beneﬁts of heterogeneous systems. The introduction of parallel programming models and semantics such as CUDA [15], OpenCL [2] and OpenACC [3] addresses the need for programming heterogeneous processors through a homogeneous programming interface. Homogeneous pro- gramming models facilitate functionality porting but often necessitate device-speciﬁc code tweaking to achieve perfor- mance porting. In this work we propose a throughput oriented performance porting (TOPP) framework that leverages code restructur- ing techniques to enable automatic performance porting of CUDA kernels onto FPGAs. CUDA oﬀers explicit control over (i) data memory spaces, (ii) computation distribution across cores, and (iii) thread synchronization. Hence, CUDA kernels designed for the GPU architecture may not map eﬃ- ciently on reconﬁgurable devices. The TOPP framework pro- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. DAC ’13, May 29 - June 07 2013, Austin, TX, USA. Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00. posed in this work, leverages the hierarchical region graph (HRG) representation to eﬃciently analyse and restructure the kernel code. Restructuring entails a wide range of trans- formations including code motions, synchronization elimina- tion (through array renaming), data communication elimina- tion (through rematerialization), and idle thread elimination (through control ﬂow fusion and loop interchange). As data handling plays a critical role in the performance of massively parallel CUDA kernels, the proposed ﬂow employs advanced dataﬂow and symbolic analysis techniques to eﬃciently man- age data. Graph coloring in tandem with throughput esti- mation techniques is used to optimize kernel data structure allocation and utilization of on-chip memories. Through or- chestration of diﬀerent code transformation and optimization techniques, the TOPP framework generates C code which is fed to high-level synthesis (HLS) to generate high-throughput custom accelerators on the reconﬁgurable architecture. Our experimental study shows that the proposed ﬂow improves FPGA execution performance by more than 4X without man- ual code tweaking from the user. The main contributions of this work are summarized below: • Introduction of the hierarchical region graph represen- tation of CUDA kernels. • Implementation of an automated performance porting ﬂow from CUDA to FPGAs. • Description of eﬃcient throughput metrics for through- put oriented kernel restructuring. • Experimental evaluation of the performance porting ca- pability of the TOPP framework. In the next Section we provide further background infor- mation on CUDA-to-FPGA ﬂows and introduce the HRG representation. Section 3 oﬀers an overview of the TOPP framework which is complemented by algorithms and other implementation details in the Appendices. Finally, Section 4 contains the experimental evaluation of TOPP followed by conclusion in Section 5. 2. MOTIVATION AND BACKGROUND CUDA employs a SIMT (single instruction, multiple threads) parallel programming interface which eﬃciently expresses mul- tiple ﬁne-grained threads executing as groups of cooperative thread arrays (CTA). The GPU architecture comprises high- throughput compute cores grouped in Streaming Multiproces- sors (SMs). Computation is distributed across SMs at CTA granularity [15]. A carefully crafted interconnect scheme be- tween SMs and oﬀ-chip memory facilitates high-bandwidth data accesses at low latency overhead.