IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 11, NOVEMBER 1998 1741 A Process-Independent, 800-MB/s, DRAM Byte-Wide Interface Featuring Command Interleaving and Concurrent Memory Operation Matthew M. Griffin, Member, IEEE, Jared Zerbe, Member, IEEE, Grace Tsang, Michael Ching, Member, IEEE, and Clemenz L. Portmann, Member, IEEE Abstract—An 800-MB/s/pin byte-wide interface DRAM is de- scribed that meets the bandwidth requirements for modern mi- croprocessor systems. Clock recovery and I/O circuitry perform to specification across multiple DRAM manufacturers’ processes. The clock-recovery circuitry is described in depth for areas that are sensitive to power-supply noise. I/O circuitry for preserving signal integrity in high-speed bussed systems are described. De- sign methodology that enables rapid simulation and verification of the design in each fabrication process are discussed. Logic that enables interleaved transactions with concurrent operation is detailed. Computer-aided-design tools for large aspect merged logic/memory are discussed. Last, measured results are summa- rized showing clock jitter, setup and hold timing, and period versus operation. Index Terms— CAD, data communication, delay-locked loop, DRAM, phase-locked loop. I. INTRODUCTION M ODERN microprocessors require very-high-bandwidth communication to the memory subsystem. Designers have historically attempted to meet this requirement by in- creasing the memory bus width. Often, bus frequencies can only incrementally be increased due to the challenges of reliable high-speed signaling. Memory bus widths beyond 64 bits create more challenges such as ground bounce for controller I/O, large numbers of pins required for data path and power supply, printed circuit board layout complexity, and incorporation of required memory expandability. Another method that meets this challenge of high-bandwidth commu- nication is presented in this paper. A third-generation Rambus DRAM device attains greater than 800-MB/s performance while passing an extensive test suite of 260 000 vectors using a design ported across multiple DRAM vendors (Fig. 1). This design satisfies the performance needs by overcoming the high-speed signaling challenge and building upon a top-down memory architecture with improved circuitry. II. MEMORY ORGANIZATION AND INTERFACE A memory core with four independent banks allows row address strobe (RAS)/column address strobe (CAS) operations on any two of the banks simultaneously. The memory core is similar to an EDO/SDRAM memory core. No special Manuscript received April 3, 1998; revised June 8, 1998. The authors are with Rambus, Inc., Mountain View, CA 94040 USA (e- mail: matt@rambus.com). Publisher Item Identifier S 0018-9200(98)07047-4. circuitry, including redundancy, is needed. Major differences in the memory core architecture are division of the row and column decoder for four banks/1024 rows/256 column organization and nonhierarchical I/O buses for 64/72-b parallel data transport. Word-line speed is improved by distribution of row decoders into four banks. Column operation speed is improved by nonhierarchical I/O and reduced I/O loading since outputs are distributed evenly across the Rambus interface width instead of clustering on the top side (Fig. 2). Another improvement is a memory interface that allows concurrent operation and the ability to interleave com- mand/data packets with bandwidth approaching 95% of its peak. The concurrent operation permits simultaneous, independent RAS and CAS activity to any two of the memory banks. A flexible mechanism allows delayed starting for data transfers whose lengths are unconstrained to optimize bus-utilization efficiency. Four pieces of information are encoded and distinguishable on the control signals: command start, command op-code, data-transmission start, and data- transmission stop. The fine control for data-transmission startup allows for very efficient memory core CAS activity for any length of data transfers. Real-time encoding distinguishes between attention and data-transfer requests, thereby balancing high-speed interface pin usage. This encoding and data-flow control allows data and control packets to be packed end to end with very few bus idle situations. III. CLOCK-RECOVERY CIRCUITRY Building the associated low-jitter clock-recovery circuit presents a significant challenge, especially when the design has to be ported to multiple DRAM processes at different partner companies. Some DRAM processes use a negative substrate bias while using the single power supply. As the back-bias voltage is usually generated by a high-impedance charge pump, it can have several hundred millivolts of noise injected onto it during periods of heavy activity [Fig. 8(a)]. Other sources of noise to consider are traditional / noise, input clock high-frequency cycle-to-cycle jitter, and duty-cycle jitter. Improvements in clock-generation circuits allow them to better reject each of these noise sources than the previous design [2]. The delay-locked loop (DLL) architecture [3] has several blocks with integration of bandgap-controlled nMOS/ cur- rent into capacitance (Fig. 3). Building current/capacitance 0018–9200/98$10.00 1998 IEEE