2220 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Yao Chen, Swathi T. Gurumani, Member, IEEE, Yun Liang, Guofeng Li, Donghui Guo, Senior Member, IEEE , Kyle Rupnow, Member, IEEE, and Deming Chen, Senior Member, IEEE Abstract—High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architec- ture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alter- native to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation. Manuscript received May 20, 2015; revised September 8, 2015 and October 26, 2015; accepted October 26, 2015. Date of publication December 8, 2015; date of current version May 20, 2016. This work was supported by the Research Grant for the Human-Centered Cyber-Physical Systems Programme within the Advanced Digital Sciences Center through the Agency for Science, Technology and Research, Singapore. Y. Chen is with the College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300071, China, and also with the Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, IL 61801 USA (e-mail: yaochen@ mail.nankai.edu.cn). S. T. Gurumani is with the Advanced Digital Sciences Center, Singapore 138632 (e-mail: swathi.g@adsc.com.sg). Y. Liang is with the School of Electrical Engineering and Computer Science, Peking University, Beijing 100871, China (e-mail: ericlyun@pku.edu.cn). G. Li is with the College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300071, China (e-mail: ligf@nankai.edu.cn). D. Guo is with the School of Information Science and Engineering, Xiamen University, Xiamen 361006, China (e-mail: dhguo@xmu.edu.cn). K. Rupnow is with Advanced Digital Sciences Center, Singapore 138632 (e-mail: k.rupnow@adsc.com.sg). D. Chen is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, IL 61801 USA (e-mail: dchen@illinois.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2015.2497259 Index Terms—CUDA, high-level synthesis (HLS), network-on- chip (NoC), parallel languages. I. I NTRODUCTION H IGH-LEVEL synthesis (HLS) has increasingly been adopted in hardware design to improve design time and to perform design space exploration. Development, debug, and design space exploration in high-level languages allow improved breadth of exploration and reduced designer’s effort. A variety of input languages have been used with HLS, including Java [1], Haskell [2], [3], C/C++ [4]–[8], OpenCL [9]–[11], C# [12], SystemC [13], [14], and CUDA [15]–[18]. In general, using serial languages, such as C/C++, HLS tools use user input and automatic paral- lelization to generate a single, monolithic accelerator kernel. In contrast, using parallel languages, HLS tools generate small simple accelerators for independent threads of computation with the intention that multiple accelerators are instantiated to scale implemented parallelism. As a popular parallel pro- gramming language, there are many existing kernels in CUDA, and CUDA-to-FPGA (FCUDA) can explore kernel compu- tation with FPGAs as an accelerator [15]–[18]. This also provides a common programming language that can program heterogeneous computing platforms that contain both graphic processing units (GPUs) and FPGAs [18]. In the FCUDA flow [15]–[18], each hardware core has private on-chip memory and computation logic, and multiple cores are instantiated to improve throughput and latency. This throughput-oriented synthesis allows fine-grained scaling of the parallelism but also places stress on on-chip com- munication and external memory bandwidth. When instan- tiating many cores, they must share access to external memory ports. Furthermore, the cores may process over- lapping data; thus, the opportunity to share data on-chip can reduce off-chip bandwidth pressure. For example, with cores accelerating matrix multiplication (Fig. 1), independent blocks process overlapping input data that can be shared on-chip. For a multicore accelerator design, cores must be inter- connected to share access to external memory ports, as well as to enable intercore communication for data-sharing. Cores may be interconnected through a shared bus, point-to-point connections, or a network-on-chip (NoC). Shared busses are area efficient but do not scale in total bandwidth as the number of cores increases. In contrast, point-to-point interconnections 1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.