IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 1203 zyx A 300-MHz 64-b Quad-Issue CMOS RISC Microprocessor Bradley J. Benschneider, Andrew J. Black, zyxwvutsr Member, IEEE, William J. Bowhill, Member, ZEEE, Sharon M. Britton, Daniel E. Dever, Dale R. Donchin, Member, IEEE, Robert J. Dupcak, Richard M. Fromm, Mary K. Gowan, Paul E. Gronowski, Michael Kantrowitz, Member, IEEE, Marc E. Lamere, Shekhar Mehta, Jeanne E. Meyer, Robert 0. Mueller, Andy Olesin, Ronald P. Preston, Member, ZEEE, Donald A. Priore, Sribalan Santhanam, Michael J. Smith, and Gilbert M. Wolrich Abstract- This 300 MHz quad-issue custom VLSI implemen- tation of the Alpha architecture delivers 1200 MIPS (peak), 600 MFLOPS (peak), 341 SPECiut92, and 512 SPECfp92. The 16.5 mm zyxwvutsrqpon x 18.1 mm die contains 9.3 M transistors and dissipates 50 W at 300 MHz. It is fabricated in a 3.3 V, four-layer metal, 0.5 pm, CMOS process. The upper metal layers (metal-3 and metal- 4) are primarily used for power, ground, and clock distribution. The chip supports 3.3 VI50 V interfaces and is packaged in a 499-pin ceramic IPGA. It contains an 8-kbyte instruction cache; an 8-kbyte, dual-ported, data cache; and a 96-kbyte, unified, second-level, 3-way set associative, fully pipelined, write- back cache. This paper describes the circuit and implementation techniques that were used to attain the 300 MHz operating frequency. I. INTRODUCTION SECOND-GENERATION Alpha RISC microprocessor A has been designed that operates at an internal clock frequency of 300 MHz. The 16.5 mm x 18.1 mm die contains 9.3 million transistors and delivers a peak performance of 1.2 billion instructions per second (BIPS) and 600 million floating point operations per second (MFLOPS). This chip has attained measured performance of 341 SPECint92 and 512 SPECfp92. The chip is implemented in a 3.3 V, 4-layer metal, 0.5 pm, CMOS process and is housed in a 499-pin interstitial pin grid array (IPGA) package. Power dissipation is 50 W from a 3.3 V supply at 300 MHz. Fig. 1 shows a photomicrograph of the chip with an overlay showing all major sections. The high performance of this second-generationimplemen- tation results from many factors, including: 0.5 zyxwvutsr pm CMOS process technology; 300 MHz internal clock frequency; grid based power and clock distribution; fast and versatile latching scheme; innovative circuit techniques; advanced design and verification tools. In addition, several architectural improvements over the first Alpha implementation [1] are included in this design. The key architectural performance features are four-way superscalar instruction issue; a high-throughput,nonblocking memory sub- Manuscript received May 4, 1995; revised August 24, 1995. The authors are with Digital Semiconductor, Hudson, MA 01749 USA. IEEE Log Number 9415232. system with low latency primary caches; a large second-level on-chip write-back cache; and reduced operational latencies in all of the functional units. 11. ARCHITECTURE As shown in Fig. 2, the chip is functionally partitioned into the following major sections: the instruction unit (I-Box), the integer execution unit (E-Box), the floating point unit (F-Box), the memory management unit (M-Box), and the cache control and bus interface unit (C-Box). The chip features two levels of on-chip cache. The first level consists of an 8-kbyte instruction cache (I-Cache) and an 8-kbyte data cache (D-Cache). The second level is a 96-kbyte unified instruction and data cache. The I-Box contains the 8-kbyte, direct-mapped I-Cache, an instruction prefetcher and associated refill buffer, branch prediction logic, and a 48-entry, fully associative instruction translation buffer. The I-Box can issue up to two integer and two floating point instructionsper cycle. Instructions are issued in-order but may complete out-of-order. The E-Box contains two execution pipelines and a register file for integer operands. Both E-Box pipelines execute load, arithmetic, and logical instructions. In addition, one of the pipelines executes shift and store instructions, while the other pipeline completes jumps and branches. Multiply instructions are executed in a separate unit attached to one of the pipelines. Both pipelines implement full register bypassing, allowing the results from all function units to be available for immediate use. All integer instructions except multiply complete in one cycle. The F-Box contains a register file for floating point operands and two execution pipelines. One pipeline executes multiply instructions while the other executes all reinaining instructions. Divide instructions are executed in a separate unit attached to one of the pipelines. All floating point instructions except divide execute in four cycles, a two-cycle reduction from the previous implementation. The M-Box contains the 8-kbyte, direct-mapped D-Cache, a fully-associative, 64-entry, data translation buffer (DTB), a miss address file for queuing and merging misses from the first-level caches, and a write buffer. The M-Box processes load, store, and memory barrier instructions. 0018-9200/95$04.00 zyxwvuts 0 1995 IEEE