420 IEEE JOURNAL OF SOLID STATE CIRCUITS, VOL. 37, NO. 3, MARCH 2002
A Design Environment for High-Throughput
Low-Power Dedicated Signal Processing Systems
W. Rhett Davis, Member, IEEE, Ning Zhang, Student Member, IEEE, Kevin Camera, Student Member, IEEE,
Dejan Markovic ´ , Student Member, IEEE, Tina Smilkstein, Student Member, IEEE, M. Josie Ammer,
Engling Yeo, Student Member, IEEE, Stephanie Augsburger, Student Member, IEEE, Borivoje Nikolic ´ , Member, IEEE,
and Robert W. Brodersen, Fellow, IEEE
Abstract—A hierarchical automated design flow for low-energy
direct-mapped signal processing integrated circuits is presented.
A modular framework based on a combined dataflow graph and
floorplan description drives automatic layout generation with com-
mercial CAD tools. Automatic characterization of layout improves
system-level estimates. Simplified physical design methodologies
for low supply voltages are discussed. The flow is demonstrated on
a 300-k transistor test-chip, a time-division multiple-access base-
band receiver, and a soft-output Viterbi decoder. An example of
architectural comparison of energy efficiency is presented.
Index Terms—Application specific integrated circuits, design au-
tomation, design methodology, integrated circuit design, parallel
architectures, system analysis and design.
I. INTRODUCTION
T
HE architectures commonly used to implement signal-pro-
cessing algorithms in hardware differ most significantly
in terms of efficiency and flexibility. General purpose proces-
sors are the least energy- and area-efficient, while slightly more
specialized architectures, such as programmable digital signal
processors, can often accomplish the same task with an order
of magnitude less energy. The most efficient architectures in
terms of power and area can be obtained by directly mapping the
algorithms into hardware. Computational energy and area effi-
ciencies that can be achieved with this approach are 100–1000
MOPS/mW and 100–1000 MOPS/mm . These efficiencies can
be two to three orders of magnitude higher than the efficiency
achieved by software processors [1].
A direct-mapped architecture can be obtained by mapping
the operations of a dataflow graph directly into functional units
and hard-wiring the connections between them. In this way, the
maximum parallelism can be obtained, allowing the minimum
clock rate and supply voltage to be used, resulting in reduced en-
ergy per operation [2]. The ability to exploit a high level of par-
allelism allows computational rates that far exceed uniproces-
Manuscript received July 24, 2001; revised October 22, 2001. This work was
supported by DARPA and the member companies of the Berkeley Wireless Re-
search Center.
W. R. Davis is with the Berkeley Wireless Research Center, Berkeley, CA
94704 USA (e-mail: wrdavis@eecs.berkeley.edu).
N. Zhang is with Atheros Communications, Inc., Sunnyvale, CA 94085 USA.
K. Camera is with Atheros Communications, Inc., Sunnyvale, CA 94085
USA. He is also with the Department of Electrical Engineering and Computer
Science, University of California, Berkeley, CA 94704 USA.
D. Markocic ´, T. Smilkstein, M. J. Ammer, E. Yeo, S. Augsburger, B. Nikolic ´,
and R. W. Brodersen are with the Department of Electrical Engineering and
Computer Science, University of California, Berkeley, CA 94704 USA.
Publisher Item Identifier S 0018-9200(02)01695-5.
(a) (b)
(c)
Fig. 1. A simple data-flow graph for: (a) a three-tap FIR filter, (b) a
direct-mapped implementation, and (c) a resource-shared implementation.
sors without requiring high clock rates. For example, a direct-
mapped implementation of the three-tap finite-impulse response
(FIR) filter graph shown in Fig. 1(a) would contain a delay line,
three multipliers, and two adders as shown in Fig. 1(b). In con-
trast, a resource-shared architecture such as the one shown in
Fig. 1(c) alters the dataflow graph in order to reduce the de-
sign to a single multiplier and adder. The energy required for
the computation can be modeled with the equation
(1)
0018–9200/02$17.00 © 2002 IEEE