420 IEEE JOURNAL OF SOLID STATE CIRCUITS, VOL. 37, NO. 3, MARCH 2002 A Design Environment for High-Throughput Low-Power Dedicated Signal Processing Systems W. Rhett Davis, Member, IEEE, Ning Zhang, Student Member, IEEE, Kevin Camera, Student Member, IEEE, Dejan Markovic ´ , Student Member, IEEE, Tina Smilkstein, Student Member, IEEE, M. Josie Ammer, Engling Yeo, Student Member, IEEE, Stephanie Augsburger, Student Member, IEEE, Borivoje Nikolic ´ , Member, IEEE, and Robert W. Brodersen, Fellow, IEEE Abstract—A hierarchical automated design flow for low-energy direct-mapped signal processing integrated circuits is presented. A modular framework based on a combined dataflow graph and floorplan description drives automatic layout generation with com- mercial CAD tools. Automatic characterization of layout improves system-level estimates. Simplified physical design methodologies for low supply voltages are discussed. The flow is demonstrated on a 300-k transistor test-chip, a time-division multiple-access base- band receiver, and a soft-output Viterbi decoder. An example of architectural comparison of energy efficiency is presented. Index Terms—Application specific integrated circuits, design au- tomation, design methodology, integrated circuit design, parallel architectures, system analysis and design. I. INTRODUCTION T HE architectures commonly used to implement signal-pro- cessing algorithms in hardware differ most significantly in terms of efficiency and flexibility. General purpose proces- sors are the least energy- and area-efficient, while slightly more specialized architectures, such as programmable digital signal processors, can often accomplish the same task with an order of magnitude less energy. The most efficient architectures in terms of power and area can be obtained by directly mapping the algorithms into hardware. Computational energy and area effi- ciencies that can be achieved with this approach are 100–1000 MOPS/mW and 100–1000 MOPS/mm . These efficiencies can be two to three orders of magnitude higher than the efficiency achieved by software processors [1]. A direct-mapped architecture can be obtained by mapping the operations of a dataflow graph directly into functional units and hard-wiring the connections between them. In this way, the maximum parallelism can be obtained, allowing the minimum clock rate and supply voltage to be used, resulting in reduced en- ergy per operation [2]. The ability to exploit a high level of par- allelism allows computational rates that far exceed uniproces- Manuscript received July 24, 2001; revised October 22, 2001. This work was supported by DARPA and the member companies of the Berkeley Wireless Re- search Center. W. R. Davis is with the Berkeley Wireless Research Center, Berkeley, CA 94704 USA (e-mail: wrdavis@eecs.berkeley.edu). N. Zhang is with Atheros Communications, Inc., Sunnyvale, CA 94085 USA. K. Camera is with Atheros Communications, Inc., Sunnyvale, CA 94085 USA. He is also with the Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94704 USA. D. Markocic ´, T. Smilkstein, M. J. Ammer, E. Yeo, S. Augsburger, B. Nikolic ´, and R. W. Brodersen are with the Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94704 USA. Publisher Item Identifier S 0018-9200(02)01695-5. (a) (b) (c) Fig. 1. A simple data-flow graph for: (a) a three-tap FIR filter, (b) a direct-mapped implementation, and (c) a resource-shared implementation. sors without requiring high clock rates. For example, a direct- mapped implementation of the three-tap finite-impulse response (FIR) filter graph shown in Fig. 1(a) would contain a delay line, three multipliers, and two adders as shown in Fig. 1(b). In con- trast, a resource-shared architecture such as the one shown in Fig. 1(c) alters the dataflow graph in order to reduce the de- sign to a single multiplier and adder. The energy required for the computation can be modeled with the equation (1) 0018–9200/02$17.00 © 2002 IEEE