IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 23 A Super-Pipelined Energy Efficient Subthreshold 240 MS/s FFT Core in 65 nm CMOS Dongsuk Jeon, Student Member, IEEE, Mingoo Seok, Student Member, IEEE, Chaitali Chakrabarti, David Blaauw, Senior Member, IEEE, and Dennis Sylvester, Fellow, IEEE Abstract—This paper proposes a design approach targeting circuits operating at extremely low supply voltages, with the goal of reducing the voltage at which energy is minimized, thereby improving the achievable energy efficiency of the circuit. The proposed methods accomplish this by minimizing the cir- cuit’s ratio of leakage to active current. The first method, super pipelining, increases the number of pipeline stages compared to conventional ultra low voltage (ULV) pipelining strategies, reducing the leakage/dynamic energy ratio and simultaneously improving performance and energy efficiency. Measurements of super-pipelined multipliers demonstrate 30% energy savings and 1.6 performance improvement. Since super pipelining reduces the logic depth between registers, two-phase latch based design is employed to compensate for reduced averaging effects and pro- vide better variation tolerance. The second technique introduces a parallel-pipelined architecture that suppresses leakage energy by ensuring full utilization of functional units and reduces memory size. We apply these techniques to a 16-b 1024-pt complex-valued Fast Fourier Transform (FFT) core along with low-power first-in first-out (FIFO) design and robust clock distribution network. The FFT core is fabricated in 65 nm CMOS and consumes 15.8 nJ/FFT with a clock frequency of 30 MHz and throughput of 240 Msamples/s at , providing 2.4 better energy efficiency than current state-of-art and higher throughput than typical ULV designs. Measurements of 60 dies show modest frequency (energy) spreads of 7% (2%). Index Terms—Fast Fourier Transform (FFT), subthreshold CMOS circuits, super-pipelining, ultra low voltage (ULV) design. I. INTRODUCTION R ECENTLY, voltage scaling has been widely applied to highly energy-constrained systems such as battery-pow- ered sensor nodes to minimize energy consumption. Voltage scaling enables energy efficient computation by quadratic (or greater) reductions of switching and leakage power dissipa- tion. Although voltage scaling increases gate delay and thus degrades performance, it is still advantageous for many appli- cations with relaxed performance requirements [1], [2] and the supply voltage may be scaled down to, or below, the device threshold voltage . However, leakage energy consumption Manuscript received April 22, 2011; revised June 27, 2011; accepted August 22, 2011. Date of publication November 04, 2011; date of current version De- cember 23, 2011. This paper was approved by Guest Editor Tanay Karnik. This work was supported by the Multiscale Systems Center, Army Research Lab- oratory, National Science Foundation, and National Institute of Standards and Technology. D. Jeon, D. Blaauw, and D. Sylvester are with the University of Michigan, Ann Arbor, MI 48109-2121 USA (e-mail: djeon@umich.edu). M. Seok is with Texas Instruments, Dallas, TX 75243 USA. C. Chakrabarti is with Arizona State University, Tempe, AZ 85287 USA. Digital Object Identifier 10.1109/JSSC.2011.2169311 per cycle increases due to enlarged stage delay as voltage scales and this overhead starts to exceed the switching energy savings below the optimal operating point , producing optimal energy consumption . Therefore there exists a fundamental limit for energy savings from voltage scaling in the subthreshold regime regardless of [3]. To enhance energy efficiency beyond this point, leakage energy must be suppressed by elimination of idle gates or other techniques to boost the utilization of each gate or module in the system. Since ultra-low voltage operation incurs high process/voltage/tem- perature (PVT) variation [4], variation tolerance should also be considered in designing these low voltage systems. Such an energy-optimal design methodology is demonstrated on a Fast Fourier Transform (FFT) accelerator in this work. The FFT is a key digital signal processing (DSP) algorithm and is widely used in digital communication and sensor signal processing. Aided by technology scaling, FFT accelerators have become feasible, offering higher energy efficiency than gen- eral purpose processors even for volume-constrained systems such as sensor nodes [2], [5]. We use such an FFT core as a demonstration vehicle for several circuit and architectural tech- niques aimed at reducing and , while achieving un- usually high throughput for a subthreshold circuit. Past work in power efficient FFTs include [6], where the authors propose a cached-memory FFT architecture that processes intermediate results within cached data sets to minimize the number of main memory accesses. In [5], the authors employ voltage scaling to improve energy efficiency. They use standard cells and mem- ories optimized for subthreshold operation and target their de- sign at the optimal energy operating point. However, the body of prior work in this area has not investigated the key role of leakage energy in the subthreshold regime, and we show that energy efficiency can be improved beyond the conventional op- timal energy operating point by suppressing leakage effectively. This paper is an extension of [7]. It describes the use of various circuit techniques such as super-pipelining along with an architectural study focused on extending voltage scalability and enhancing performance in the design of 1024-point com- plex-valued FFT core. The use of super-pipelining improves performance and reduces leakage energy, but removes aver- aging effects of random process variability due to shorter logic depth. As a result we employ two-phase latches rather than edge-triggered registers to recapture some averaging through time borrowing. Measured results of these techniques on a multiplier show 30% energy savings concurrently with 1.6 performance improvement over a conventional unpipelined multiplier. A parallel-pipelined FFT architecture is then proposed to maximize computational element and memory 0018-9200/$26.00 © 2011 IEEE