2001 IEEE International Solid-State Circuits Conference 0-7803-6608-5 ©2001 IEEE ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.6 9.6 A 150MHz Graphics Rendering Processor with 256Mb Embedded DRAM Aurangzeb K. Khan+, Hidetaka Magoshi*, Tadashi Matsumoto#, Jun-ichi Fujita**, Makoto Furuhashi*, Masatoshi Imai**, Yoshikazu Kurose#, Morio Sato#, Katsuhiko Sato#, Yujiro Yamashita#, Kinying Kwan+, Duc-Ngoc Le+, John H. Yu+, Trung Nguyen+, Steven Yang+, Allen Tsou+, King Chow+, John Shen+, Min Li+, Jun Li+, Hong Zhao+, Kenji Yoshida+ + Altius Solutions, Inc., Santa Clara, CA * Sony Computer Entertainment Inc., Tokyo, Japan # Sony Corp., Semiconductor Network Co., Tokyo, Japan ** Sony Kihara Research Center Inc. , Tokyo, Japan A 150MHz graphics rendering processor with 256Mb Embedded DRAM (eDRAM), delivers a rendering rate of 75M Polygons/s. 287.5M transistors (7.5M logic, 280M in 256Mb eDRAM) are inte- grated on one 21.3x21.7mm 2 die in 0.18μ 6-metal CMOS. eDRAM bandwidth is 48GB/s (Table 9.6.1). Architectural and electrical design enhancements and advanced process technology enable an 8-fold increase in eDRAM integration (vs. current PlayStation2 graphics rendering processor) to achieve 1920x1080 60frame/s progressive dis- play resolution, which is beyond the digital HDTV standard (720 pro- gressive and 1080 interlaced). The graphics rendering processor for a real-time high-resolution computer graphics development system is based on an enhanced ver- sion of the PlayStation2 computer entertainment system architec- ture. The prototype system includes 16 sets of graphics units, which are a combination of a 128b microprocessor [1] and this graphics ren- dering processor. Electrical design considerations for large-area ultra-deep sub-micron (UDSM) ICs include: power distribution, clocking architecture, chip signal integrity, and timing convergence. Cross-chip line width vari- ation effects need to be addressed [2]. Careful electrical design of large-scale RC networks and design technology to characterize inter- connect-dominated timing and signal integrity are critical to address these requirements [3]. A hierarchical, block-based design methodology accelerates design schedule by enabling concurrent design of VLSI-scale sub-blocks in parallel with top-level design (Figure 9.6.1) [4]. Accurate “black-box” models reflecting boundary I/O loading and drive strength charac- teristics are critical to chip timing and signal integrity verification. Conventional block models estimate boundary conditions with linear input slew rate and lumped-capacitance output loading. These mod- els ignore driver nonlinear transconductance and the resistance shielding effect of signal dispersion in resistive signal traces, con- tributing to delay over-estimation and signal integrity inaccuracies (Figure 9.6.2). Linear single-slope driver slew-rate estimation also limits accuracy. Chip-level timing can be verified (to limited accura- cy) only at the last stage with a flattened design to account for signal RC networks. This causes late-stage design iterations and is imprac- tical for large designs. Potential timing defects are left latent. A nonlinear numerical model, called an efficient current source model (ECSM), represents instantaneous driver charging/discharg- ing currents. The model achieves <2% correlation with SPICE (Figure 9.6.3). A unique ECSM is created for each input-to-output timing path, for each circuit in the design. The ECSM driver model is then applied to the net-specific RC network to calculate signal slew rates at both driving and driven points. Resistance shielding on the RC network is calculated with <5% correlation with SPICE (Figure 9.6.2). Asymptotic waveform estimation (AWE) is used to match high- er-order moments for the driving-point admittance and impedance matrices, Y(s) and Z(s), and the transfer function G(s), to each driven point. Complex interconnect RCs are reduced to a smaller order of moments matrix (2 to 7). Dominant time constants are then calcu- lated so that an approximated transfer function and driving point Y(s) and Z(s) can be provided for numerical-analysis-based delay cal- culation. These reduced-RC models are connected with driving point ECSM to calculate the signal waveform (at driving and driven points) and interconnect delay. ECSM-based block models accurately reflect driver nonlinear behavior and RC-loading effects. >400k components in eleven blocks, 31k to 218k gates each, are designed in parallel; “black-box” ECSM models are generated and hierarchically propagated to top level. Top-level delay calculation maintains design hierarchy and hence provides exact boundary con- ditions per-block, to which block-level timing is verified. This fully- hierarchical methodology enables block and top-level design to pro- ceed in parallel, accelerating the schedule. Chip design from netlist- to-tapeout completed in 10 weeks. Functional architecture, signal integrity, power/clock distribution and internal and I/O bus timing requirements drive chip floorplan (Figure 9.6.1). Top- and block-level synthesis is driven from floorplan- based custom wire-load models; <2% of the >500k total signal nets required post-layout timing optimization. The chip power/ground architecture supports >2A/2-sides sustained current. 100 30μm-wide M5&6 buses (per two sides) conduct power from VDD, VSS pins to the blocks across ~5 mm./side distance. Block power distribution is on M4. Worst-case IR drop is ~50mV. The chip clock architecture delivers <250ps worst-case skew across the entire chip (under worst-case device, metal and operating condi- tions) while driving 68k flip-flops, with a balanced clock tree at top- level and non-uniform ladder & mesh clock schemes within blocks. Inter-block clocks achieve <100ps skew. Insertion delay variation is reduced with multiple parallel buffers. The buffer circuit achieves a balanced duty cycle (<3% w/c). Signal slew rate, equivalent load capacitance, node-to-node and net- to-net coupling capacitance, and delay data are maintained in a fully- hierarchical database. Loading capacitance at each driving node and signal slew rate at each receiving node is computed, so under-driven or high-load nets exceeding slew rate limits (e.g., 3ns) are automati- cally determined; a replace-or-insert methodology is then applied to modify the design netlist. This automation accelerates static-timing- analysis setup and hold time optimization. >10k STA violations are corrected in <2 weeks. I/O simultaneous switching noise effects are addressed by co-opti- mizing I/O circuit design and bus I/O, clock/strobe I/O and VDD/VSS pin assignments. Current transients are isolated by placing cuts in I/O power buses to confine high-frequency noise, with dedicated VDD/VSS pads per segment to isolate bus-to-bus noise. Edge-sensi- tive and asynchronous I/Os are isolated with dedicated VDD/VSS pads, as are core and I/O VDD and VSS. The chip size makes on-chip signal integrity challenging. Net-to-net- coupling-induced delay variation is addressed (including resistance shielding effects) by analyzing all nets for crosstalk immunity (e.g., >20%-of-total-capacitance coupled from one neighbor) and inserting buffers or moving neighbor segments to 2x pitch (Figure 9.6.4). >2k nets are >10mm. long. ~4k repeaters are added to ensure robust sig- nal integrity. Performance data is summarized in Figure 9.6.5 (shmoo plot) and in Table 9.6.1. The IC micrograph is shown in Figure 9.6.6. Acknowledgments: The authors thank H. Takeuchi and S. Iwasaki of Sony Kihara Research Center Inc., M. Kaihatsu, A. Tamura, A. Yamazaki, T. Horioka, A. Hakomori, T. Sekihara, M. Kitano, and K. Inoue of Sony Corp., Semiconductor Network Co., and, K. Fujita, H. Nagashima, H. Furuzono and H. Truong of Altius Solutions, Inc. for contributions. References: [1] M.Suzuoki et. al., “A Microprocessor with a 128-bit CPU, Ten Floating Point MAC’s, Four Floating-Point Dividers, and an MPEG-2 Decoder” IEEE J. Solid- State Circuits, vol. 34, pp.1608-1618, Nov. 1999. [2] S. Nassif, “Delay Variablity: Sources, Impacts and Trends,” ISSCC Digest of Technical Papers, pp. 369-69, Feb. 2000, [3] S. Naffziger, “Design Methodologies for Interconnects in GHZ+ Ics,” ISSCC Short Course, Feb. 1999 [4] S. Nemazie et. al., “260Mb/s Mixed-Signal Single-Chip Integrated System Electronics for Magnetic Hard Disk Drives,” ISSCC Digest of Technical Papers, pp. 42-43, 443, and Slide Supplement to the Digest of Technical Papers, pp. 44- 45, Feb. 1999