Architectural Synthesis of Computational Pipelines with Decoupled Memory Access Shaoyi Cheng and John Wawrzynek Department of EECS, UC Berkeley, California, USA 94720 Email: sh cheng@berkeley.edu, johnw@eecs.berkeley.edu Abstract—As high level synthesis (HLS) moves towards main- stream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation. However, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. In this paper, we present an automatic flow to refactor and restructure processor- centric software implementations, making them better suited for FPGA platforms. The methodology generates pipelines that decouple memory operations and data access from computation. The resulting pipelines have much better throughput due to their efficient use of the memory bandwidth and improved tolerance to data access latency. The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors. With this approach, for a set of non-regular algorithm kernels written in C, a performance improvement of 3.3 to 9.1x is observed over direct C-to-Hardware mapping using a state-of- the-art HLS tool. Keywords—FPGA, Hardware Acceleration, High-level Synthe- sis, Memory-level Parallelism, Pipeline Parallelism, Memory Sub- system Optimization I. I NTRODUCTION As the complexity of FPGA designs increases, there has been a trend towards design synthesis from higher levels of specifications. Being more compact and expressive, high level languages, when used as design input, can greatly increase the productivity of engineers. To tackle the challenge of generating hardware functional blocks from high level behavioral descrip- tions, many commercial [1], [2] and open source [3], [4] tools have been developed over the years. Programming languages such as C/C++, designed for processor-centric execution, are used by these high-level synthesis programs as the medium for input specification. Meanwhile, recent developments in FPGA SoCs, where the reconfigurable arrays are integrated with hard processors and memory interface IPs, have created highly versatile computing platforms [5]. This combination of new tools and devices has created new opportunities for applications written in high level languages. Applications can be mapped to these heterogeneous substrates, with the compute intensive loop nests running in accelerators and the remainder of the code executing on processors. However, the performance boost of the mapped implementations is often less than optimal when HLS tools are employed to directly map the software code to the reconfigurable fabric. Fundamentally, the barrier between software and the FPGA fabric is more than just the programming language used—the real difference lies in the paradigms of computation. To produce good FPGA designs with HLS, the users still need to visualize and create hardware descriptions, albeit with the C/C++ syntax. To effectively harness the power of reconfigurable platforms for software acceleration, in addition to inserting pragmas and directives, designers often need to restructure the original code to separate out memory accesses before invoking HLS. Also, to boost FPGA accelerator efficiency it is often desirable to convert from conventional memory accesses to a streaming model and to insert DMA engines [6]. Further enhancements can be achieved by including accelerator specific caching and burst accesses. In this paper, we try to narrow the gap between software and hardware execution mechanisms by automatically trans- forming application kernels into pipelines of processing stages, complemented by load/store primitives capable of pipelined data accesses. Our flow slices the original control dataflow graph (CDFG) of the performance critical loop nests into subgraphs, connected with acyclic communication (section III). Special transformations are then performed on memory op- erations to allow pipelining of outstanding requests in the memory subsystem (section IV-A). Furthermore, the hardware structures connecting the accelerator and the memory are synthesized based on the observed data access patterns of the program (section IV-B). Finally, each of the subgraphs is fed to a conventional high-level synthesis flow, generating independent datapaths and controllers. FIFO channels are instantiated to connect the datapaths, forming the final system (section V). When compared to hardware synthesized directly from the original program using HLS, the accelerators produced by our flow have superior performance (section VI). Their tolerance to data access latency is also demonstrated with a variety of memory subsystem configurations. The main contributions of this paper are: • a novel tool flow for converting software loop nests to pipelines of decoupled processing stages, where: ◦ the effects of long latency operations are lo- calized, ◦ memory load/store operations are converted to data access modules which use memory bandwidth efficiently, and ◦ customization of memory access mechanisms based on the data access patterns of the accel- erated loop nests. • an experimental evaluation of our approach against direct mappings using a state-of-the-art HLS tool, on FPGA SoCs with hard processors and memory interface IPs.