2220 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016
FCUDA-NoC: A Scalable and Efficient
Network-on-Chip Implementation
for the CUDA-to-FPGA Flow
Yao Chen, Swathi T. Gurumani, Member, IEEE, Yun Liang, Guofeng Li, Donghui Guo, Senior Member, IEEE ,
Kyle Rupnow, Member, IEEE, and Deming Chen, Senior Member, IEEE
Abstract—High-level synthesis (HLS) of data-parallel input
languages, such as the Compute Unified Device Architec-
ture (CUDA), enables efficient description and implementation
of independent computation cores. HLS tools can effectively
translate the many threads of computation present in the parallel
descriptions into independent, optimized cores. The generated
hardware cores often heavily share input data and produce
outputs independently. As the number of instantiated cores
grows, the off-chip memory bandwidth may be insufficient to
meet the demand. Hence, a scalable system architecture and
a data-sharing mechanism become necessary for improving
system performance. The network-on-chip (NoC) paradigm for
intrachip communication has proved to be an efficient alter-
native to a hierarchical bus or crossbar interconnect, since it
can reduce wire routing congestion, and has higher operating
frequencies and better scalability for adding new nodes. In this
paper, we present a customizable NoC architecture along with
a directory-based data-sharing mechanism for an existing
CUDA-to-FPGA (FCUDA) flow to enable scalability of our system
and improve overall system performance. We build a fully
automated FCUDA-NoC generator that takes in CUDA code and
custom network parameters as inputs and produces synthesizable
register transfer level (RTL) code for the entire NoC system. We
implement the NoC system on a VC709 Xilinx evaluation board
and evaluate our architecture with a set of benchmarks. The
results demonstrate that our FCUDA-NoC design is scalable and
efficient and we improve the system execution time by up to 63×
and reduce external memory reads by up to 81% compared with
a single hardware core implementation.
Manuscript received May 20, 2015; revised September 8, 2015 and
October 26, 2015; accepted October 26, 2015. Date of publication
December 8, 2015; date of current version May 20, 2016. This work was
supported by the Research Grant for the Human-Centered Cyber-Physical
Systems Programme within the Advanced Digital Sciences Center through
the Agency for Science, Technology and Research, Singapore.
Y. Chen is with the College of Electronic Information and Optical
Engineering, Nankai University, Tianjin 300071, China, and also with the
Department of Electrical and Computer Engineering, University of Illinois at
Urbana–Champaign, Urbana, IL 61801 USA (e-mail: yaochen@
mail.nankai.edu.cn).
S. T. Gurumani is with the Advanced Digital Sciences Center, Singapore
138632 (e-mail: swathi.g@adsc.com.sg).
Y. Liang is with the School of Electrical Engineering and Computer Science,
Peking University, Beijing 100871, China (e-mail: ericlyun@pku.edu.cn).
G. Li is with the College of Electronic Information and Optical Engineering,
Nankai University, Tianjin 300071, China (e-mail: ligf@nankai.edu.cn).
D. Guo is with the School of Information Science and Engineering, Xiamen
University, Xiamen 361006, China (e-mail: dhguo@xmu.edu.cn).
K. Rupnow is with Advanced Digital Sciences Center, Singapore 138632
(e-mail: k.rupnow@adsc.com.sg).
D. Chen is with the Department of Electrical and Computer Engineering,
University of Illinois at Urbana–Champaign, Urbana, IL 61801 USA (e-mail:
dchen@illinois.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2015.2497259
Index Terms—CUDA, high-level synthesis (HLS), network-on-
chip (NoC), parallel languages.
I. I NTRODUCTION
H
IGH-LEVEL synthesis (HLS) has increasingly been
adopted in hardware design to improve design time and
to perform design space exploration. Development, debug,
and design space exploration in high-level languages allow
improved breadth of exploration and reduced designer’s effort.
A variety of input languages have been used with
HLS, including Java [1], Haskell [2], [3], C/C++ [4]–[8],
OpenCL [9]–[11], C# [12], SystemC [13], [14], and
CUDA [15]–[18]. In general, using serial languages, such
as C/C++, HLS tools use user input and automatic paral-
lelization to generate a single, monolithic accelerator kernel.
In contrast, using parallel languages, HLS tools generate small
simple accelerators for independent threads of computation
with the intention that multiple accelerators are instantiated
to scale implemented parallelism. As a popular parallel pro-
gramming language, there are many existing kernels in CUDA,
and CUDA-to-FPGA (FCUDA) can explore kernel compu-
tation with FPGAs as an accelerator [15]–[18]. This also
provides a common programming language that can program
heterogeneous computing platforms that contain both graphic
processing units (GPUs) and FPGAs [18].
In the FCUDA flow [15]–[18], each hardware core has
private on-chip memory and computation logic, and multiple
cores are instantiated to improve throughput and latency.
This throughput-oriented synthesis allows fine-grained scaling
of the parallelism but also places stress on on-chip com-
munication and external memory bandwidth. When instan-
tiating many cores, they must share access to external
memory ports. Furthermore, the cores may process over-
lapping data; thus, the opportunity to share data on-chip
can reduce off-chip bandwidth pressure. For example, with
cores accelerating matrix multiplication (Fig. 1), independent
blocks process overlapping input data that can be shared
on-chip.
For a multicore accelerator design, cores must be inter-
connected to share access to external memory ports, as well
as to enable intercore communication for data-sharing. Cores
may be interconnected through a shared bus, point-to-point
connections, or a network-on-chip (NoC). Shared busses are
area efficient but do not scale in total bandwidth as the number
of cores increases. In contrast, point-to-point interconnections
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.