TRANSPOSE-FREE SAR IMAGING ON FPGA PLATFORM Chi-Li Yu and Chaitali Chakrabarti School of Electrical, Computer and Energy Engineering Arizona State University, Tempe, USA Email: {chi-li.yu, chaitali}@asu.edu ABSTRACT Range-Doppler Algorithm (RDA) and Chirp Scaling Algo- rithm (CSA) are two widely used Synthetic Aperture Radar (SAR) imaging schemes. Both require multiple transpose op- erations which increase the total processing time signiﬁcantly. In this paper, we propose transpose-free ﬂow for both RDA and CSA. This is achieved by modifying the existing ﬂows in order to utilize the access patterns favored by the external memory. As a result, the peak performance of the memory is sustained and the processing time shortened. The proposed Field Programmable Gate Array (FPGA)-based implementa- tion outperforms the existing SAR accelerators; it computes RDA and CSA on data size of 4, 096 × 4, 096 in 323ms and 162ms, respectively. Index Terms— SAR, FPGA, DRAM, FFT. 1. INTRODUCTION Synthetic Aperture Radar (SAR) has been widely used in mil- itary surveillance, environmental monitoring, and earth re- source surveys. From an airborne or a space-borne platform, a SAR system generates high resolution images covering large areas in all weather conditions, day or night. The raw data collected by a SAR system is highly unfocused due to elec- tromagnetic wave scattering and the relative motion between the radar and the earth surface. Several algorithms have been developed for digital SAR imaging, which include range-Dopper algorithm (RDA) [1] and chirp-scaling algorithm (CSA) [2]. The key kernels of these algorithms are Discrete Fourier transform (DFT), interpolations, and convolutions. Since the algorithms can be highly parallelized, they can be computed efﬁciently on multiple parallel computing platforms, including Graphics Processing Unit (GPU) and Field Programmable Gate Array (FPGA). Compared to GPU, FPGA consumes much lower power, which makes it more suitable for SAR image process- ing on airplanes or satellites. For real-time SAR imaging, the bottleneck is data trans- fer between the chip and external memory. SAR images are typically very large, e.g. 4,096×4,096, and must be stored in This work is supported in part by the Defense Advanced Research Projects Agency (DARPA) under Grant W911NF-05-1-0248. a external memory, which is usually a Synchronous Dynamic RAM (SDRAM). SDRAM’s transfer rate is slow compared to processor clock speeds, and so the performance of SAR imaging systems is determined by the memory bandwidth. Furthermore, SAR imaging algorithms need to perform computations along row and column directions of a 2D im- age several times. Because SDRAM only favors row-wise burst access, most SAR processors [3, 4, 5] need to transpose the 2D data before column-wise operations. This is done by transferring column-wise data from the SDRAM to the chip, realigning the column data into adjacent addresses, and stor- ing back to the memory. Then, another transpose operation is required before the next row-operation. Since all SAR imag- ing algorithms require multiple transpose operations, the tim- ing performance of SAR imaging is worsened signiﬁcantly. To eliminate the transpose operation, the method in [6] stores SAR data into a multi-chip SDRAM array and takes advantage of the multi-banking memory organization to get rid of the overhead when accessing column-wise data. How- ever, the design in [6] does not support general SDRAM mod- ules, and signiﬁcant customization has to be done. Another way to eliminate transpose operations is to increase the lo- cality of data along column direction in a SDRAM module. The method in [7] re-maps the column-wise data into a phys- ical page of SDRAM to increase the access efﬁciency. These methods can achieve 80% of SDRAM’s peak rate for both row and column-wise accesses. However, the data re-mapping be- fore the SAR imaging process needs extra time. In this paper, we propose transpose-free SAR imaging ﬂows for RDA and CSA. The ﬂows are mapped to a uniﬁed architecture which is implemented on an FPGA-based plat- form. The implementation has superior timing performance, since it avoids transpose operations and utilizes the memory bandwidth efﬁciently. Simulation results based on the Xil- inx ML605 platform show that the RDA and CSA computa- tions with data size 4096 × 4096 can be completed in 323ms and 162ms respectively. This implementation outperforms existing SAR image accelerators, including FPGA- and GPU- based solutions [7, 8, 9]. The rest of the paper is organized as follows. A brief de- scription of RDA and CSA is given in Section 2. In Section 3, the transpose-free imaging ﬂows for RDA and CSA are 1