• 2001 IEEE International Solid-State Circuits Conference 0-7803-6608-5 ©2001 IEEE
ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.6
9.6 A 150MHz Graphics Rendering Processor with 256Mb
Embedded DRAM
Aurangzeb K. Khan+, Hidetaka Magoshi*, Tadashi Matsumoto#, Jun-ichi
Fujita**, Makoto Furuhashi*, Masatoshi Imai**, Yoshikazu Kurose#, Morio
Sato#, Katsuhiko Sato#, Yujiro Yamashita#, Kinying Kwan+, Duc-Ngoc Le+,
John H. Yu+, Trung Nguyen+, Steven Yang+, Allen Tsou+, King Chow+, John
Shen+, Min Li+, Jun Li+, Hong Zhao+, Kenji Yoshida+
+ Altius Solutions, Inc., Santa Clara, CA
* Sony Computer Entertainment Inc., Tokyo, Japan
# Sony Corp., Semiconductor Network Co., Tokyo, Japan
** Sony Kihara Research Center Inc. , Tokyo, Japan
A 150MHz graphics rendering processor with 256Mb Embedded
DRAM (eDRAM), delivers a rendering rate of 75M Polygons/s.
287.5M transistors (7.5M logic, 280M in 256Mb eDRAM) are inte-
grated on one 21.3x21.7mm
2
die in 0.18μ 6-metal CMOS. eDRAM
bandwidth is 48GB/s (Table 9.6.1). Architectural and electrical design
enhancements and advanced process technology enable an 8-fold
increase in eDRAM integration (vs. current PlayStation2 graphics
rendering processor) to achieve 1920x1080 60frame/s progressive dis-
play resolution, which is beyond the digital HDTV standard (720 pro-
gressive and 1080 interlaced).
The graphics rendering processor for a real-time high-resolution
computer graphics development system is based on an enhanced ver-
sion of the PlayStation2 computer entertainment system architec-
ture. The prototype system includes 16 sets of graphics units, which
are a combination of a 128b microprocessor [1] and this graphics ren-
dering processor.
Electrical design considerations for large-area ultra-deep sub-micron
(UDSM) ICs include: power distribution, clocking architecture, chip
signal integrity, and timing convergence. Cross-chip line width vari-
ation effects need to be addressed [2]. Careful electrical design of
large-scale RC networks and design technology to characterize inter-
connect-dominated timing and signal integrity are critical to address
these requirements [3].
A hierarchical, block-based design methodology accelerates design
schedule by enabling concurrent design of VLSI-scale sub-blocks in
parallel with top-level design (Figure 9.6.1) [4]. Accurate “black-box”
models reflecting boundary I/O loading and drive strength charac-
teristics are critical to chip timing and signal integrity verification.
Conventional block models estimate boundary conditions with linear
input slew rate and lumped-capacitance output loading. These mod-
els ignore driver nonlinear transconductance and the resistance
shielding effect of signal dispersion in resistive signal traces, con-
tributing to delay over-estimation and signal integrity inaccuracies
(Figure 9.6.2). Linear single-slope driver slew-rate estimation also
limits accuracy. Chip-level timing can be verified (to limited accura-
cy) only at the last stage with a flattened design to account for signal
RC networks. This causes late-stage design iterations and is imprac-
tical for large designs. Potential timing defects are left latent.
A nonlinear numerical model, called an efficient current source
model (ECSM), represents instantaneous driver charging/discharg-
ing currents. The model achieves <2% correlation with SPICE
(Figure 9.6.3). A unique ECSM is created for each input-to-output
timing path, for each circuit in the design. The ECSM driver model
is then applied to the net-specific RC network to calculate signal slew
rates at both driving and driven points. Resistance shielding on the
RC network is calculated with <5% correlation with SPICE (Figure
9.6.2). Asymptotic waveform estimation (AWE) is used to match high-
er-order moments for the driving-point admittance and impedance
matrices, Y(s) and Z(s), and the transfer function G(s), to each driven
point. Complex interconnect RCs are reduced to a smaller order of
moments matrix (2 to 7). Dominant time constants are then calcu-
lated so that an approximated transfer function and driving point
Y(s) and Z(s) can be provided for numerical-analysis-based delay cal-
culation. These reduced-RC models are connected with driving point
ECSM to calculate the signal waveform (at driving and driven points)
and interconnect delay. ECSM-based block models accurately reflect
driver nonlinear behavior and RC-loading effects.
>400k components in eleven blocks, 31k to 218k gates each, are
designed in parallel; “black-box” ECSM models are generated and
hierarchically propagated to top level. Top-level delay calculation
maintains design hierarchy and hence provides exact boundary con-
ditions per-block, to which block-level timing is verified. This fully-
hierarchical methodology enables block and top-level design to pro-
ceed in parallel, accelerating the schedule. Chip design from netlist-
to-tapeout completed in 10 weeks.
Functional architecture, signal integrity, power/clock distribution
and internal and I/O bus timing requirements drive chip floorplan
(Figure 9.6.1). Top- and block-level synthesis is driven from floorplan-
based custom wire-load models; <2% of the >500k total signal nets
required post-layout timing optimization.
The chip power/ground architecture supports >2A/2-sides sustained
current. 100 30μm-wide M5&6 buses (per two sides) conduct power
from VDD, VSS pins to the blocks across ~5 mm./side distance. Block
power distribution is on M4. Worst-case IR drop is ~50mV.
The chip clock architecture delivers <250ps worst-case skew across
the entire chip (under worst-case device, metal and operating condi-
tions) while driving 68k flip-flops, with a balanced clock tree at top-
level and non-uniform ladder & mesh clock schemes within blocks.
Inter-block clocks achieve <100ps skew. Insertion delay variation is
reduced with multiple parallel buffers. The buffer circuit achieves a
balanced duty cycle (<3% w/c).
Signal slew rate, equivalent load capacitance, node-to-node and net-
to-net coupling capacitance, and delay data are maintained in a fully-
hierarchical database. Loading capacitance at each driving node and
signal slew rate at each receiving node is computed, so under-driven
or high-load nets exceeding slew rate limits (e.g., 3ns) are automati-
cally determined; a replace-or-insert methodology is then applied to
modify the design netlist. This automation accelerates static-timing-
analysis setup and hold time optimization. >10k STA violations are
corrected in <2 weeks.
I/O simultaneous switching noise effects are addressed by co-opti-
mizing I/O circuit design and bus I/O, clock/strobe I/O and VDD/VSS
pin assignments. Current transients are isolated by placing cuts in
I/O power buses to confine high-frequency noise, with dedicated
VDD/VSS pads per segment to isolate bus-to-bus noise. Edge-sensi-
tive and asynchronous I/Os are isolated with dedicated VDD/VSS
pads, as are core and I/O VDD and VSS.
The chip size makes on-chip signal integrity challenging. Net-to-net-
coupling-induced delay variation is addressed (including resistance
shielding effects) by analyzing all nets for crosstalk immunity (e.g.,
>20%-of-total-capacitance coupled from one neighbor) and inserting
buffers or moving neighbor segments to 2x pitch (Figure 9.6.4). >2k
nets are >10mm. long. ~4k repeaters are added to ensure robust sig-
nal integrity.
Performance data is summarized in Figure 9.6.5 (shmoo plot) and in
Table 9.6.1. The IC micrograph is shown in Figure 9.6.6.
Acknowledgments:
The authors thank H. Takeuchi and S. Iwasaki of Sony Kihara Research
Center Inc., M. Kaihatsu, A. Tamura, A. Yamazaki, T. Horioka, A. Hakomori,
T. Sekihara, M. Kitano, and K. Inoue of Sony Corp., Semiconductor Network
Co., and, K. Fujita, H. Nagashima, H. Furuzono and H. Truong of Altius
Solutions, Inc. for contributions.
References:
[1] M.Suzuoki et. al., “A Microprocessor with a 128-bit CPU, Ten Floating Point
MAC’s, Four Floating-Point Dividers, and an MPEG-2 Decoder” IEEE J. Solid-
State Circuits, vol. 34, pp.1608-1618, Nov. 1999.
[2] S. Nassif, “Delay Variablity: Sources, Impacts and Trends,” ISSCC Digest
of Technical Papers, pp. 369-69, Feb. 2000,
[3] S. Naffziger, “Design Methodologies for Interconnects in GHZ+ Ics,” ISSCC
Short Course, Feb. 1999
[4] S. Nemazie et. al., “260Mb/s Mixed-Signal Single-Chip Integrated System
Electronics for Magnetic Hard Disk Drives,” ISSCC Digest of Technical Papers,
pp. 42-43, 443, and Slide Supplement to the Digest of Technical Papers, pp. 44-
45, Feb. 1999