IEEE SOLID-STATE CIRCUITS LETTERS, VOL. 1, NO. 12, DECEMBER 2018 225 A 7-nm 6R6W Register File With Double-Pumped Read and Write Operations for High-Bandwidth Memory in Machine Learning and CPU Processors Hoan Nguyen , Jihoon Jeong, Francois Atallah, Daniel Yingling, and Keith Bowman Abstract—A 7-nm register ﬁle (RF) with a 16-transistor (16T) 3-read and 3-write (3R3W) bitcell double pump or time multiplexes the read and write access ports twice per clock cycle to achieve 6-read and 6- write (6R6W) operations per cycle for high-bandwidth (BW) on-die memory in high-performance machine learning and CPU processors. From silicon test-chip measurements at 0.9 V, the double-pumped (DP) 6R6W RF trades off a 19% lower maximum clock frequency (F MAX ) for 2× the number of read and write operations per cycle, resulting in a 62% higher memory BW compared to a conventional single-access (SA) 3R3W RF. Index Terms—Double pump, high-bandwidth (BW) memory, machine learning (ML) memory, register ﬁle (RF), time multiplex. I. I NTRODUCTION Machine learning (ML) processors contain a large array (i.e., 1000’s) of multiply accumulate (MAC) units executing in parallel to achieve high throughput. This enormous level of parallelism requires on-die memories with ultrahigh bandwidth (BW) to maintain a high- utilization rate across the MAC units [1], [2]. In addition, modern high-performance out-of-order (OoO) CPUs employ a large physical register ﬁle (PRF) to allow for a wider microarchitectural instruction window to enhance the instruction-level parallelism. Furthermore, the PRF in a high-performance CPU requires a signiﬁcant number of read and write ports to effectively execute the many OoO inﬂight instructions. For these reasons, the large PRF with many read and write ports is critical to achieve high instructions per cycle (IPC) in today’s CPUs. Traditional approaches for improving memory BW to allow more reads and writes per clock cycle include increasing the number of bit- cell read and write ports or duplicating the bitcell. These approaches signiﬁcantly increase the memory area, power, and latency [3]. Although hierarchical and banked register ﬁle (RF) designs [4] aim to achieve a high CPU performance with a relatively small PRF and low number of ports, these techniques introduce microarchitectural complexity, IPC degradation, as well as power and area overheads. Clustered microarchitectures [5] allow a lower number of RF ports, where each cluster of functional units contains a separate RF. This approach, however, incurs signiﬁcant performance degradation due to the intercluster communication. A previous RF [3] time multiplexes or double pumps the bitcell write ports to effectively double the number of RF write ports. Although this technique doubles the RF write operations per clock cycle, the conventional single-access (SA) read operation limits the RF BW or requires a bitcell duplica- tion technique to double the read operations per clock cycle at a cost of ∼2× in area and power [3]. This letter describes an RF with a 16-transistor (16T) 3-read and 3-write (3R3W) bitcell (Fig. 1) in a 7-nm [6] test chip that double pumps the read and write operations Manuscript received December 2, 2018; revised January 25, 2019; accepted February 11, 2019. Date of publication April 18, 2019; date of current version May 17, 2019. This paper was approved by Associate Editor Alvin Leng Sun Loke. (Corresponding author: Hoan Nguyen.) The authors are with Qualcomm Technologies, Inc., Raleigh, NC 27617 USA (e-mail: hoann@qti.qualcomm.com). Digital Object Identiﬁer 10.1109/LSSC.2019.2911885 Fig. 1. 3R3W bitcell schematic and layout. Fig. 2. Test-chip die micrograph and characteristics. to achieve 6-reads and 6-writes (6R6W), thus enabling high-BW on- die memory for high-performance ML and CPU processors while avoiding the excessive increases in area, power, and latency from either replicating the bitcells or adding more bitcell read and write ports [7]. II. DESIGN AND I MPLEMENTATION Implemented in a 7-nm FinFET CMOS technology [6], the test chip in Fig. 2 features the RF organized into four banks of 32 words with 32 bits per wordline (WL) and 16 bits per local read bit line (LRBL). The four banks share the global read bit line (GRBL). The RF conﬁguration allows operation in either SA or double-pumped (DP) modes for each physical read and/or write port to enable a range of RF port counts from 3R3W to 6R6W. The RF interfaces with an on-die built-in self-test (BIST) unit to verify functionality. The total RF area is 8052 μm 2 . In Fig. 3, the RF read circuit consists of decode logic, LRBL precharge and evaluation, and GRBL. The salient insight to the DP read is duplicating the decode logic and GRBL to enable two sep- arate decode and GRBL operations while only sharing the LRBL 2573-9603 c  2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.