Coarse-Grain Pipelining on Multiple FPGA Architectures* Heidi Ziegler**, Byoungro So, Mary Hall, Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 {ziegler, bso, mhall, pedro}@isi.edu Abstract Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving p ipelined designs from high-level sequential specifications . Keywords: Coarse-grain Pipelining, FPGA-based Custom Computing Machines; Parallelizing Compiler Analysis Techniques. 1. Introduction The implementation of pipelined execution techniques is an effective method to improve the throughput of a given computing architecture. By dividing the set of operations to be executed into subsets or pipe stages, pipelining increases the number of operations that may execute simultaneously, thus exploiting the implicit parallelism in the sequential application and effectively using available resources. The overall performance improvements come from higher throughput in spite of the fact that the execution time for each individual operation remains unaltered. In a traditional or synchronous pipeline, the pipe stages are defined such that they contain equal amounts of computation in order to avoid idle processing time that might occur in an unbalanced system. When pipe stages are not easily balanced, due to widely varying types of operations that make up the computation, an asynchronous pipeline is employed. As a result, the flow of data between neighbors is controlled by a handshaking protocol that triggers data availability. The asynchronous pipeline provides an ideal processing model on which digital image processing [22] applications execute efficiently. Typically, these applications process multiple images using simple image operators. Examples of common image processing operators include a wide range of stencil operators ( e.g., over a fixed N x N window) or simple thresholding and offsetting computations. The typical application has multiple loop nests inside a main loop for iterating over all data commonly implemented as multi-dimensional arrays. FPGA-based computing machines offer a unique opportunity for the design of custom pipelining structures matched to each application. One or several loop bodies can be synthesized on each of the FPGAs. In addition, the layout of the stages and their connectivity can be designed to match the application requirements in terms of the relative consumer/producer rates each stage exhibits. Internal FPGA register resources and direct wires can be used to establish high-performance inter-stage communication, avoiding excessive buffer read/write and synchronization operations and thereby increasing overall throughput. The complexity and sophistication of data orchestration and control of pipelined execution make automatic tools that can analyze sequential applications and derive pipelined implementations extremely desirable. Fortunately, the domain of digital image processing and graphics are a perfect match for existing parallelizing compiler analysis techniques. Using these techniques a compiler and synthesis system can analyze the input sequential code and partition its data and computation among multiple FPGAs for pipelined execution. The compiler analyzes the set of pipeline stages and schedules them onto the target architecture respecting the original program data dependences and the target architecture’s FPGA and memory capacity constraints. In so doing, the compiler analysis can derive communication requirements between pipeline stages. In this paper we describe a set of compiler analyses that address the issues in automatically mapping computations expressed in high-level sequential languages such as C, directly to FPGA-based computing architectures. In particular it makes the following specific contributions: It describes an implementation of several paralle lizing compiler analysis techniques and transformations required to automatically design platform and application specific pipelines, which have been extended to map computations onto FPGA-based architectures. _________________________ * Funded by the Defense Advanced Research Project Agency under contract number F30603-98-2-0113. ** Funded by a Boeing Satellite Systems Doctoral Scholars Fellowship.