IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 4, APRIL 2006 759
A Fully Pipelined Single-Precision Floating-P
Unit in the Synergistic Processor Element
of a CELL Processor
Hwa-Joon Oh, Member, IEEE , Silvia M. Mueller , Christian Jacobi , Kevin D. Tran , Scott R. Cottier, Member, IEEE ,
Brad W. Michael , Hiroo Nishikawa , Yonetaro Totsuka , Tatsuya Namatame , Naoka Yano , Takashi Machida , and
Sang H. Dhong, Fellow, IEEE
Abstract—The floating-point unit(FPU)in the synergistic
processor element (SPE) of a CELL processor is a fully pipelined
4-way single-instruction multiple-data (SIMD) unitdesigned to
accelerate media and data streaming with 128-bit operands. It
supports 32-bit single-precision floating-point and 16-bit integer
operands with two different latencies, six-cycle and seven-cycle,
with 11 FO4 delay per stage. The FPU optimizes the performance
of critical single-precision multiply–add operations. Since exact
rounding, exceptions, and de-norm number handling are not
important to multimedia applications, IEEE correctness on the
single-precision floating-point numbers is sacrificed for perfor-
mance and simple design. It employs fine-grained clock gating
for power saving. The design has 768K transistors in 1.3 mm
2
,
fabricated SOI in 90-nm technology. Correctoperations have
been observed up to 5.6 GHz with 1.4 V and 56 C, delivering
44.8 GFlops. Architecture, logic,circuits, and integration are
codesigned to meet the performance, power, and area goals.
Index Terms—Floating-point arithmetic, integrated circuit de-
sign, microprocessors, very large-scale integration.
I. I NTRODUCTION
T
HE synergistic processing element (SPE) [1] of a CELL
processor [2] is the first implementation [3] of a new archi-
tecture designed to accelerate multimedia applications, such as
three-dimensional (3-D) graphics, media streaming, and signal
processing. Real-time multimedia applications demand single-
precision performance significantly exceeding that of conven-
tional processors. An SPE contains a set of 128 registers that
are 128 bits wide. These registers are used by both floating-point
unit for single- and double- precision arithmetic and fixed-point
unit for 32-bit integer arithmetic and logical operations [1].
The single-precision floating-point unit(FPU) of the SPE
is a 4-way single-instruction multiple-data (SIMD) design.
Vector computing (or SIMD) has been used in supercomputers
and modern microprocessors have media extensions such as
Manuscript received September 5, 2005; revised December 19, 2005.
H.-J. Oh, K. D. Tran, S. R. Cottier, B. W. Michael, and S. H. Dhong are with
the IBM System and Technology Group, Austin, TX 78758 USA.
S. M. Mueller and C. Jacobi are with IBM Entwicklung GmbH, Boeblingen
71032, Germany.
H. Nishikawa is with IBM Global Services, Yasu 520-2392, Japan.
Y. Totsuka is with the Digital Imaging Group, Sony Corporation, Tokyo 108-
6201, Japan.
T. Namatame, N. Yano, and T. Machida are with the Semiconductor Com-
pany, Toshiba Corporation, Kawasaki 212-8520, Japan.
Digital Object Identifier 10.1109/JSSC.2006.870924
SSE,MMX, and VMX/AltiVec for an SIMD design. Most
instructions at the FPU in the SPE process 128-bit operands,
divided into four 32-bit word slices. Each of the four slices
ports 32-bit single-precision and 16-bit integer multiply–add
instructions and convert instructions between single-preci
floating point and integer. The single-precision floating-point
multiply–add instruction consumes threeregisteroperands
and produces a register result. Operands are fetched from the
register file (RF)to the operand latches of the FPU.Either
floating-point resultsare bypassed directly from the result
multiplexer of the FPU to the input operand latches of the
to reduce the result latency or results of the FPU are sent to
the forward unit (FW) from where they are distributed to o
units (i.e., fixed-point unit,register file,or local-store unit).
Fig. 1 shows a simplified FPU pipeline structure.
II. DESIGN C HALLENGES
A. 11 FO4 Design
Recentstudies [4] show that the pipeline depth for a per-
formance optimized design is in the range of 6–8 fanout-of
(FO4) inverter delays per stage. Whereas for a power- and
formance-optimized design, study [5] suggests an optimal
of about 18 FO4 per stage (consisting of a logic delay of 15
and 3 FO4 latch insertion delay). The first implementation of
CELL processor uses a design point of 11 FO4 per stage, s
performance and power are important. The latency of the
is six cycles for single-precision instructions in order to me
the performance requirements of the target workloads. Sin
sult forwarding takes 6 FO4, the whole delay budget inclu
latch insertion delays is 60 FO4 for the single-precision log
A state-of-the-art FPU has a latency of around 100 FO4 [6]
since single-precision data are usually handled in the doub
precision unit. To design a dedicated six-cycle single-preci
floating-point unit with 11 FO4 cycle time required optimiz
tions at all design levels: architecture, logic, circuits, layou
placement.
B. Latch Insertion Delays
CMOS static gates are used to implement most of the log
dynamic circuits are used in certain timing critical areas. A
insertion delay of 2 FO4 to 3 FO4 occupies a significant po
of the 11 FO4 cycle. In order to minimize this delay, a spec
latch selection was provided (Table I). There are three typ
latches: type-C, type-D, and type-E. All latches have multip
0018-9200/$20.00 © 2006 IEEE