288 IEEE TRANSACTIONS ON ADVANCED PACKAGING, VOL. 28, NO. 2, MAY 2005 Optimal Chip-Package Codesign for High-Performance DSP Pronita Mehrotra, Member, IEEE, Vikram Rao, Thomas M. Conte, Senior Member, IEEE, and Paul D. Franzon, Senior Member, IEEE Abstract—In high-performance DSP systems, the memory bandwidth can be improved using high-density interconnect technology and appropriate memory mapping. High-density MCM and flip-chip solder bump technology is used to achieve a system with an I/O bandwidth of 100 Gb/s/cm die. The use of DRAMs in these systems usually make the performance of these systems poor, and some algorithms make it difficult to fully utilize the available memory bandwidth. This paper presents the design of an fast Fourier transform (FFT) engine that gives SRAM-like performance in a DRAM-based system. It uses almost 100% of the available burst-mode memory bandwidth. This FFT engine can compute a million-point FFT in 1.31 ms at a sustained computation rate of 8.64 10 floating-point operations per second (FLOPS). This is at least an order of magnitude better than conventional systems. Index Terms—Chip-package codesign, fast Fourier transform (FFT), seamless high off-chip connectivity (SHOCC). I. INTRODUCTION H IGH-PERFORMANCE DSP applications, like synthetic aperture radar (SAR), require extremely large computa- tion rates and have large working data sets. By using a high-den- sity interconnect technology like seamless high off-chip con- nectivity (SHOCC), memory bandwidth can be improved. As the demand for higher resolution DSP systems increases, the computation rate is expected to reach tera floating-point oper- ations per second (TFLOP) rates. These applications involve manipulations to large data volumes (1 GB or more), which makes it necessary from a cost point of view to use DRAMs in these systems. DRAMs, due to their refresh and row access cy- cles, would be expected to perform much worse than SRAMs in most cases. In addition, signal processing algorithms are mostly memory starved, and one expects that increasing the memory bandwidth would improve performance. Some algorithms make it difficult to fully utilize the available memory bandwidth. The work reported in this paper focuses on how large DRAM-based DSP systems that are capable of high sustained memory per- Manuscript received September 25, 2003; revised March 22, 2004. This work was supported by ARDA under Contract MDA904-00-C-2133 and the NSF under Contract EIA-9703090. P. Mehrotra was with the Department of Electrical and Computer Engi- neering, North Carolina State University, Raleigh NC 27695 USA. She is now in Allentown, PA, 18104 USA. (e-mail: pronita@ieee.org). V. Rao was with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh NC 27695 USA. He is now with Sun Microsystems, Mountain View, CA 94043 USA. T. M. Conte and P. D. Franzon are with the Department of Electrical and Com- puter Engineering, North Carolina State University, Raleigh NC 27695 USA (e-mail: conte@eos.ncsu.edu; paulf@eos.ncsu.edu). Digital Object Identifier 10.1109/TADVP.2005.846937 formance can be built using SHOCC technology. Depending on the application, an efficient memory management scheme can lead to better memory bandwidth utilization and give almost SRAM-like performance. SHOCC is a combined packaging, interconnect, and IC de- sign technology aimed at providing system-level integration by using very high density solder bump and thin-film technologies [1], [2]. The purpose of this paper is to show how the use of such a technology can result in a radical performance improvement in a specific application. We show that a five-layer thin-film mul- tichip module (MCM), together with a 140- m flip-chip solder bump technology can be used to achieve a peak I/O bandwidth of 100 Gb/s/cm die without compromising noise performance. The key to achieving this performance is the use of a redistri- bution layer with a local ground. There are several applications that can benefit from such high I/O bandwidth including net- working and graphics. In this paper, we take a large fast Fourier transform (FFT) system to demonstrate the performance gains that can be achieved when combined with an efficient memory mapping scheme and a better utilization of available resources. We chose an FFT system to demonstrate this, since the FFT is the most challenging and time consuming part in many signal processing algorithms, and is difficult to map onto DRAMs. The rest of the paper is organized as follows. Section II dis- cusses the signal integrity related issues which determine the maximum bandwidth that can be obtained from the SHOCC technology. Section III presents the results obtained after sim- ulating the SHOCC transmission line. Section IV summarizes the results of the simulation in terms of the overall bandwidth available from the SHOCC technology. Section V discusses the physical architecture of the FFT system, i.e., the layout of the memory and the microaccelerators on the SHOCC substrate, the details of the microaccelerator chip and the sequence of opera- tions of the FFT system. Section VI describes the logical archi- tecture of the design. The focus in this section is on two schemes that help improve the performance of the FFT system: 1) the memory management scheme which extracts the maximum pos- sible performance out of DRAMs and 2) the twiddle factor gen- eration scheme, which is crucial for the successful operation of the FFT. We finally conclude by analyzing the performance of the FFT engine in Section VII. II. SIGNAL INTEGRITY IN DENSELY ROUTED SUBSTRATES This section looks at various circuit-related aspects like the substrate cross section, number of routing layers, routing 1521-3323/$20.00 © 2005 IEEE