Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI B. Flachs S. Asano S. H. Dhong H. P. Hofstee G. Gervais R. Kim T. Le P. Liu J. Leenstra J. S. Liberty B. Michael H.-J. Oh S. M. Mueller O. Takahashi K. Hirairi A. Kawasumi H. Murakami H. Noro S. Onishi J. Pille J. Silberman S. Yong A. Hatakeyama Y. Watanabe N. Yano D. A. Brokenshire M. Peyravian V. To E. Iwata This paper describes the architecture and implementation of the original gaming-oriented synergistic processor element (SPE) in both 90-nm and 65-nm silicon-on-insulator (SOI) technology and introduces a new SPE implementation targeted for the high- performance computing community. The Cell Broadband Enginee processor contains eight SPEs. The dual-issue, four-way single- instruction multiple-data processor is designed to achieve high performance per area and power and is optimized to process streaming data, simulate physical phenomena, and render objects digitally. Most aspects of data movement and instruction flow are controlled by software to improve the performance of the memory system and the core performance density. The SPE was designed as an 11-FO4 (fan-out-of-4-inverter-delay) processor using 20.9 million transistors within 14.8 mm 2 using the IBM 90-nm SOI low-k process. CMOS (complementary metal-oxide semiconductor) static gates implement the majority of the logic. Dynamic circuits are used in critical areas and occupy 19% of the non–static random access memory (SRAM) area. Instruction set architecture, microarchitecture, and physical implementation are tightly coupled to achieve a compact and power-efficient design. Correct operation has been observed at up to 5.6 GHz and 7.3 GHz, respectively, in 90-nm and 65-nm SOI technology. Introduction As gaming develops into an immersive experience with more realistic rendering, object movement, and strategy, it is becoming increasingly similar to high-performance computing (HPC). To achieve high levels of realism, traditional HPC algorithms and those with characteristics much like HPC algorithms are being used in gaming. These algorithms often process massive amounts of data in a manner that can be partitioned to enable parallel execution. Throughput is often more important to HPC and gaming than the general-purpose thread with complex branching schemes. Gaming and HPC are also similar in that they are limited by factors such as component cost and power dissipation. More often than not, the question for HPC and gaming is not the speed at which a single thread can run, but the sustainable throughput per unit of system cost. Two algorithmic methods of partitioning are popular. Vectorization has long been a staple of simulation and solution within the HPC community, while today’s media-rich application software is often characterized by multiple lightweight threads and software pipelines. The trend in gaming is to merge these two characteristics. This trend in software design favors processors that utilize characteristics of vector processing and multiple threads. Software running on these processors can efficiently drive the improved memory bandwidth becoming available from commodity memory systems, while software that runs on processors designed to accelerate a single thread of execution by taking advantage of instruction-level parallelism is much less able to derive benefit from the new memory systems. Note: Portions of this paper are based on earlier publications by the authors. Ó2006 IEEE. Reprinted, with permission, from References [1] and [2]. ÓCopyright 2007 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. IBM J. RES. & DEV. VOL. 51 NO. 5 SEPTEMBER 2007 B. FLACHS ET AL. 529 0018-8646/07/$5.00 ª 2007 IBM