288 IEEE TRANSACTIONS ON ADVANCED PACKAGING, VOL. 28, NO. 2, MAY 2005
Optimal Chip-Package Codesign for
High-Performance DSP
Pronita Mehrotra, Member, IEEE, Vikram Rao, Thomas M. Conte, Senior Member, IEEE, and
Paul D. Franzon, Senior Member, IEEE
Abstract—In high-performance DSP systems, the memory
bandwidth can be improved using high-density interconnect
technology and appropriate memory mapping. High-density
MCM and flip-chip solder bump technology is used to achieve
a system with an I/O bandwidth of 100 Gb/s/cm die. The use
of DRAMs in these systems usually make the performance of
these systems poor, and some algorithms make it difficult to fully
utilize the available memory bandwidth. This paper presents
the design of an fast Fourier transform (FFT) engine that gives
SRAM-like performance in a DRAM-based system. It uses almost
100% of the available burst-mode memory bandwidth. This FFT
engine can compute a million-point FFT in 1.31 ms at a sustained
computation rate of 8.64 10 floating-point operations per
second (FLOPS). This is at least an order of magnitude better
than conventional systems.
Index Terms—Chip-package codesign, fast Fourier transform
(FFT), seamless high off-chip connectivity (SHOCC).
I. INTRODUCTION
H
IGH-PERFORMANCE DSP applications, like synthetic
aperture radar (SAR), require extremely large computa-
tion rates and have large working data sets. By using a high-den-
sity interconnect technology like seamless high off-chip con-
nectivity (SHOCC), memory bandwidth can be improved. As
the demand for higher resolution DSP systems increases, the
computation rate is expected to reach tera floating-point oper-
ations per second (TFLOP) rates. These applications involve
manipulations to large data volumes (1 GB or more), which
makes it necessary from a cost point of view to use DRAMs in
these systems. DRAMs, due to their refresh and row access cy-
cles, would be expected to perform much worse than SRAMs in
most cases. In addition, signal processing algorithms are mostly
memory starved, and one expects that increasing the memory
bandwidth would improve performance. Some algorithms make
it difficult to fully utilize the available memory bandwidth. The
work reported in this paper focuses on how large DRAM-based
DSP systems that are capable of high sustained memory per-
Manuscript received September 25, 2003; revised March 22, 2004. This work
was supported by ARDA under Contract MDA904-00-C-2133 and the NSF
under Contract EIA-9703090.
P. Mehrotra was with the Department of Electrical and Computer Engi-
neering, North Carolina State University, Raleigh NC 27695 USA. She is now
in Allentown, PA, 18104 USA. (e-mail: pronita@ieee.org).
V. Rao was with the Department of Electrical and Computer Engineering,
North Carolina State University, Raleigh NC 27695 USA. He is now with Sun
Microsystems, Mountain View, CA 94043 USA.
T. M. Conte and P. D. Franzon are with the Department of Electrical and Com-
puter Engineering, North Carolina State University, Raleigh NC 27695 USA
(e-mail: conte@eos.ncsu.edu; paulf@eos.ncsu.edu).
Digital Object Identifier 10.1109/TADVP.2005.846937
formance can be built using SHOCC technology. Depending on
the application, an efficient memory management scheme can
lead to better memory bandwidth utilization and give almost
SRAM-like performance.
SHOCC is a combined packaging, interconnect, and IC de-
sign technology aimed at providing system-level integration by
using very high density solder bump and thin-film technologies
[1], [2]. The purpose of this paper is to show how the use of such
a technology can result in a radical performance improvement in
a specific application. We show that a five-layer thin-film mul-
tichip module (MCM), together with a 140- m flip-chip solder
bump technology can be used to achieve a peak I/O bandwidth
of 100 Gb/s/cm die without compromising noise performance.
The key to achieving this performance is the use of a redistri-
bution layer with a local ground. There are several applications
that can benefit from such high I/O bandwidth including net-
working and graphics.
In this paper, we take a large fast Fourier transform (FFT)
system to demonstrate the performance gains that can be
achieved when combined with an efficient memory mapping
scheme and a better utilization of available resources. We chose
an FFT system to demonstrate this, since the FFT is the most
challenging and time consuming part in many signal processing
algorithms, and is difficult to map onto DRAMs.
The rest of the paper is organized as follows. Section II dis-
cusses the signal integrity related issues which determine the
maximum bandwidth that can be obtained from the SHOCC
technology. Section III presents the results obtained after sim-
ulating the SHOCC transmission line. Section IV summarizes
the results of the simulation in terms of the overall bandwidth
available from the SHOCC technology. Section V discusses the
physical architecture of the FFT system, i.e., the layout of the
memory and the microaccelerators on the SHOCC substrate, the
details of the microaccelerator chip and the sequence of opera-
tions of the FFT system. Section VI describes the logical archi-
tecture of the design. The focus in this section is on two schemes
that help improve the performance of the FFT system: 1) the
memory management scheme which extracts the maximum pos-
sible performance out of DRAMs and 2) the twiddle factor gen-
eration scheme, which is crucial for the successful operation of
the FFT. We finally conclude by analyzing the performance of
the FFT engine in Section VII.
II. SIGNAL INTEGRITY IN DENSELY ROUTED SUBSTRATES
This section looks at various circuit-related aspects like
the substrate cross section, number of routing layers, routing
1521-3323/$20.00 © 2005 IEEE