IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 23
A Super-Pipelined Energy Efficient Subthreshold
240 MS/s FFT Core in 65 nm CMOS
Dongsuk Jeon, Student Member, IEEE, Mingoo Seok, Student Member, IEEE, Chaitali Chakrabarti,
David Blaauw, Senior Member, IEEE, and Dennis Sylvester, Fellow, IEEE
Abstract—This paper proposes a design approach targeting
circuits operating at extremely low supply voltages, with the
goal of reducing the voltage at which energy is minimized,
thereby improving the achievable energy efficiency of the circuit.
The proposed methods accomplish this by minimizing the cir-
cuit’s ratio of leakage to active current. The first method, super
pipelining, increases the number of pipeline stages compared
to conventional ultra low voltage (ULV) pipelining strategies,
reducing the leakage/dynamic energy ratio and simultaneously
improving performance and energy efficiency. Measurements of
super-pipelined multipliers demonstrate 30% energy savings and
1.6 performance improvement. Since super pipelining reduces
the logic depth between registers, two-phase latch based design is
employed to compensate for reduced averaging effects and pro-
vide better variation tolerance. The second technique introduces a
parallel-pipelined architecture that suppresses leakage energy by
ensuring full utilization of functional units and reduces memory
size. We apply these techniques to a 16-b 1024-pt complex-valued
Fast Fourier Transform (FFT) core along with low-power first-in
first-out (FIFO) design and robust clock distribution network.
The FFT core is fabricated in 65 nm CMOS and consumes
15.8 nJ/FFT with a clock frequency of 30 MHz and throughput
of 240 Msamples/s at , providing 2.4 better
energy efficiency than current state-of-art and higher
throughput than typical ULV designs. Measurements of 60 dies
show modest frequency (energy) spreads of 7% (2%).
Index Terms—Fast Fourier Transform (FFT), subthreshold
CMOS circuits, super-pipelining, ultra low voltage (ULV) design.
I. INTRODUCTION
R
ECENTLY, voltage scaling has been widely applied to
highly energy-constrained systems such as battery-pow-
ered sensor nodes to minimize energy consumption. Voltage
scaling enables energy efficient computation by quadratic (or
greater) reductions of switching and leakage power dissipa-
tion. Although voltage scaling increases gate delay and thus
degrades performance, it is still advantageous for many appli-
cations with relaxed performance requirements [1], [2] and the
supply voltage may be scaled down to, or below, the device
threshold voltage . However, leakage energy consumption
Manuscript received April 22, 2011; revised June 27, 2011; accepted August
22, 2011. Date of publication November 04, 2011; date of current version De-
cember 23, 2011. This paper was approved by Guest Editor Tanay Karnik. This
work was supported by the Multiscale Systems Center, Army Research Lab-
oratory, National Science Foundation, and National Institute of Standards and
Technology.
D. Jeon, D. Blaauw, and D. Sylvester are with the University of Michigan,
Ann Arbor, MI 48109-2121 USA (e-mail: djeon@umich.edu).
M. Seok is with Texas Instruments, Dallas, TX 75243 USA.
C. Chakrabarti is with Arizona State University, Tempe, AZ 85287 USA.
Digital Object Identifier 10.1109/JSSC.2011.2169311
per cycle increases due to enlarged stage delay as voltage
scales and this overhead starts to exceed the switching energy
savings below the optimal operating point , producing
optimal energy consumption . Therefore there exists a
fundamental limit for energy savings from voltage scaling in
the subthreshold regime regardless of [3]. To enhance
energy efficiency beyond this point, leakage energy must be
suppressed by elimination of idle gates or other techniques to
boost the utilization of each gate or module in the system. Since
ultra-low voltage operation incurs high process/voltage/tem-
perature (PVT) variation [4], variation tolerance should also
be considered in designing these low voltage systems. Such an
energy-optimal design methodology is demonstrated on a Fast
Fourier Transform (FFT) accelerator in this work.
The FFT is a key digital signal processing (DSP) algorithm
and is widely used in digital communication and sensor signal
processing. Aided by technology scaling, FFT accelerators have
become feasible, offering higher energy efficiency than gen-
eral purpose processors even for volume-constrained systems
such as sensor nodes [2], [5]. We use such an FFT core as a
demonstration vehicle for several circuit and architectural tech-
niques aimed at reducing and , while achieving un-
usually high throughput for a subthreshold circuit. Past work
in power efficient FFTs include [6], where the authors propose
a cached-memory FFT architecture that processes intermediate
results within cached data sets to minimize the number of main
memory accesses. In [5], the authors employ voltage scaling to
improve energy efficiency. They use standard cells and mem-
ories optimized for subthreshold operation and target their de-
sign at the optimal energy operating point. However, the body
of prior work in this area has not investigated the key role of
leakage energy in the subthreshold regime, and we show that
energy efficiency can be improved beyond the conventional op-
timal energy operating point by suppressing leakage effectively.
This paper is an extension of [7]. It describes the use of
various circuit techniques such as super-pipelining along with
an architectural study focused on extending voltage scalability
and enhancing performance in the design of 1024-point com-
plex-valued FFT core. The use of super-pipelining improves
performance and reduces leakage energy, but removes aver-
aging effects of random process variability due to shorter logic
depth. As a result we employ two-phase latches rather than
edge-triggered registers to recapture some averaging through
time borrowing. Measured results of these techniques on a
multiplier show 30% energy savings concurrently with 1.6
performance improvement over a conventional unpipelined
multiplier. A parallel-pipelined FFT architecture is then
proposed to maximize computational element and memory
0018-9200/$26.00 © 2011 IEEE