IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 4, APRIL 2006 759 A Fully Pipelined Single-Precision Floating-P Unit in the Synergistic Processor Element of a CELL Processor Hwa-Joon Oh, Member, IEEE , Silvia M. Mueller , Christian Jacobi , Kevin D. Tran , Scott R. Cottier, Member, IEEE , Brad W. Michael , Hiroo Nishikawa , Yonetaro Totsuka , Tatsuya Namatame , Naoka Yano , Takashi Machida , and Sang H. Dhong, Fellow, IEEE Abstract—The ﬂoating-point unit(FPU)in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unitdesigned to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision ﬂoating-point and 16-bit integer operands with two diﬀerent latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply–add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision ﬂoating-point numbers is sacriﬁced for perfor- mance and simple design. It employs ﬁne-grained clock gating for power saving. The design has 768K transistors in 1.3 mm 2 , fabricated SOI in 90-nm technology. Correctoperations have been observed up to 5.6 GHz with 1.4 V and 56 C, delivering 44.8 GFlops. Architecture, logic,circuits, and integration are codesigned to meet the performance, power, and area goals. Index Terms—Floating-point arithmetic, integrated circuit de- sign, microprocessors, very large-scale integration. I. I NTRODUCTION T HE synergistic processing element (SPE) [1] of a CELL processor [2] is the ﬁrst implementation [3] of a new archi- tecture designed to accelerate multimedia applications, such as three-dimensional (3-D) graphics, media streaming, and signal processing. Real-time multimedia applications demand single- precision performance signiﬁcantly exceeding that of conven- tional processors. An SPE contains a set of 128 registers that are 128 bits wide. These registers are used by both ﬂoating-point unit for single- and double- precision arithmetic and ﬁxed-point unit for 32-bit integer arithmetic and logical operations [1]. The single-precision ﬂoating-point unit(FPU) of the SPE is a 4-way single-instruction multiple-data (SIMD) design. Vector computing (or SIMD) has been used in supercomputers and modern microprocessors have media extensions such as Manuscript received September 5, 2005; revised December 19, 2005. H.-J. Oh, K. D. Tran, S. R. Cottier, B. W. Michael, and S. H. Dhong are with the IBM System and Technology Group, Austin, TX 78758 USA. S. M. Mueller and C. Jacobi are with IBM Entwicklung GmbH, Boeblingen 71032, Germany. H. Nishikawa is with IBM Global Services, Yasu 520-2392, Japan. Y. Totsuka is with the Digital Imaging Group, Sony Corporation, Tokyo 108- 6201, Japan. T. Namatame, N. Yano, and T. Machida are with the Semiconductor Com- pany, Toshiba Corporation, Kawasaki 212-8520, Japan. Digital Object Identiﬁer 10.1109/JSSC.2006.870924 SSE,MMX, and VMX/AltiVec for an SIMD design. Most instructions at the FPU in the SPE process 128-bit operands, divided into four 32-bit word slices. Each of the four slices ports 32-bit single-precision and 16-bit integer multiply–add instructions and convert instructions between single-preci ﬂoating point and integer. The single-precision ﬂoating-point multiply–add instruction consumes threeregisteroperands and produces a register result. Operands are fetched from the register ﬁle (RF)to the operand latches of the FPU.Either ﬂoating-point resultsare bypassed directly from the result multiplexer of the FPU to the input operand latches of the to reduce the result latency or results of the FPU are sent to the forward unit (FW) from where they are distributed to o units (i.e., ﬁxed-point unit,register ﬁle,or local-store unit). Fig. 1 shows a simpliﬁed FPU pipeline structure. II. DESIGN C HALLENGES A. 11 FO4 Design Recentstudies [4] show that the pipeline depth for a per- formance optimized design is in the range of 6–8 fanout-of (FO4) inverter delays per stage. Whereas for a power- and formance-optimized design, study [5] suggests an optimal of about 18 FO4 per stage (consisting of a logic delay of 15 and 3 FO4 latch insertion delay). The ﬁrst implementation of CELL processor uses a design point of 11 FO4 per stage, s performance and power are important. The latency of the is six cycles for single-precision instructions in order to me the performance requirements of the target workloads. Sin sult forwarding takes 6 FO4, the whole delay budget inclu latch insertion delays is 60 FO4 for the single-precision log A state-of-the-art FPU has a latency of around 100 FO4 [6] since single-precision data are usually handled in the doub precision unit. To design a dedicated six-cycle single-preci ﬂoating-point unit with 11 FO4 cycle time required optimiz tions at all design levels: architecture, logic, circuits, layou placement. B. Latch Insertion Delays CMOS static gates are used to implement most of the log dynamic circuits are used in certain timing critical areas. A insertion delay of 2 FO4 to 3 FO4 occupies a signiﬁcant po of the 11 FO4 cycle. In order to minimize this delay, a spec latch selection was provided (Table I). There are three typ latches: type-C, type-D, and type-E. All latches have multip 0018-9200/$20.00 © 2006 IEEE