Tileable Monolithic ReRAM Memory Design Meenatchi Jagasivamani * , Candace Walden * , Devesh Singh * , Luyi Kang * , Mehdi Asnaashari † , Sylvain Dubois † , Bruce Jacob * , and Donald Yeung * * Department of Electrical & Computer Engineering, University of Maryland, College Park, MD, 20742. † Crossbar Incorporated, Santa Clara, CA, 95054. Non-volatile memory, such as resistive RAM (ReRAM), is compatible with standard CMOS logic processes, allowing a sizable main memory system to be integrated into a CPU’s die. ReRAM bitcells are fabricated within crosspoint sub- arrays that leave the bulk of transistors underneath the sub-arrays vacant. This permits placing the memory system over other logic. We propose a tileable, centralized ReRAM design over a large last level cache. This design takes advantage of ReRAMs unique characteristics while still providing flexibility to designers. Keywords: Crosspoint architectures, ReRAM and on-die main memory systems. Recently there has been a lot of focus on emerging non-volatile memory in the memory hierarchy and 3D integration. This has led to the development of non-volatile memories that allow for 3D stacking of the memory cells to improve density. Examples include Intel’s 3D XPoint [1] and Crossbar’s 3D ReRAM [2]. In these 3D memory architectures, called “crosspoint architectures”, the memory bitcells are sandwiched in between metal wires and individual bitcells are isolated by per-cell “selector devices” rather than access transistors. The use of selector devices enables extremely small bitcells that can be stacked vertically across multiple metal layers. It also means the transistors underneath these sub-arrays are free for implementing unrelated circuits. Some logic is still needed for access circuitry, but the bulk of the transistors are unused. The benefits of these memory technologies include higher densities, thus capacities, than DRAM; lower power, as refresh is no longer required; and CMOS compatibility. Whereas DRAM requires special VLSI processes tuned for implementing DRAM’s memory cells, 3D crosspoint sub-arrays can be fabricated today in commercial CMOS fabs at the same technology node as the underlying logic [2]. This implies the crosspoint memory can be fabricated over the CPU, occupying the top-level metal layers of the CPU’s die, creating a monolithically integrated CPU–main memory chip. Meanwhile, the CPU’s logic can be implemented in the die’s logic transistors, minus those needed for the memory access circuits. Putting the CPU and its memory system on the same die will significantly reduce the energy to access memory, improving power efficiency. It will also allow for an extremely wide connection between the cores and the memory system which will benefit highly parallel architectures, such as tiled CPUs. This will provide much higher throughput and performance for data-intensive computations. The drawback, though, is that these memories have higher latency than DRAM and suffer wearout. A popular approach proposed by many researchers is to retain a small amount of DRAM to buffer frequently accessed data [3], [4]. Additionally the entire main memory and CPU must fit on a single die. Integrating the two in a 2D planar fashion leads to the available area becoming a limiting factor necessitating 3D integration of the memory and CPU logic. This requires that the memory access circuitry be accounted for in CPU layout, possibly creating higher design complexity. Previous work has looked at integrating general CPU logic under crosspoint arrays [5]. They found that placing and routing the CPU required an additional 18% on top of the area requirements of the memory access circuits–with only half the die occupied by ReRAM arrays. The irregular structure of the cores is not a good fit for this kind of integration– regular structures like cache do much better. The design we propose is a modular implementation of ReRAM over a last level cache. This could be tiled across the center of the die, with the edges containing CPU cores, DRAM controllers or even accelerators depending on the requirements. An example floorplan of such a system is shown in Fig. 1. The centrally located ReRAM block in Fig. 1 is designed to act as a single embedded memory IP with a separate internal NoC. In addition to the NoC router circuits, the area underneath the ReRAM array could be used for SRAM arrays that can act as the last level cache. Designers only need to consider the number of memory banks, the size of the interconnect and its topology. Once an optimal configuration is selected, we can generate the overall ReRAM embedded block design and external interface block. This block can be used to generate a floorplan layout and provide area estimates to identify placement of individual components to achieve such a system. For the repeating block, the ReRAM arrays will be laid in 4 groups of 4, creating a tiled cross as shown in Fig. 3. The memory controller, bank controller, NoC router are placed in the free area beneath the central ReRAM arrays; SRAM cache can be placed beneath the surrounding arrays.