1138 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 34, NO. 7, JULY 2015 FALPEM: Framework for Architectural-Level Power Estimation and Optimization for Large Memory Sub-Systems Amit Chhabra, Harsh Rawat, Mohit Jain, Pascal Tessier, Daniel Pierredon, Laurent Bergher, and Promod Kumar Abstract—Framework is developed for estimation of power at pre regis- ter transfer level (RTL) stage for structured memory sub-systems. Power estimation model is proposed specifically targeting power consumed by clock network and interconnect. The model is validated with VCD-based simulation on back-annotated netlist of an 8 MB memory sub-system used as video RAM (VRAM) for high-end graphics applications. This method- ology also forms the basis for low-power exploration driving floor plan choice, gating structure of data, and clock network. We demonstrate 57% reduction in dynamic power by using low-power techniques for the 8 MB VRAM used as frame buffer in a graphics processor. FALPEM can be extended to other applications like processor cache and ASIC designs. Index TermsAnalytical model, clock network, FD-SOI, low power, memory sub-system, power estimation, power modeling, random access memory (RAM), video RAM (VRAM). I. I NTRODUCTION A DVANCED computing and graphics applications are pushing the need for large on-chip storage capacity having high through- put at the same time. Although shrinking technology nodes have enabled integration of static random access memory (SRAM) capac- ity in order of hundreds of Mbits, but meeting GHz performance is possible by building the memory sub-system as an assembly of multi- ple small SRAM instances with pipeline stages and associated logic. The power consumed by such a large memory sub-system forms a significant portion of the overall power budget. As a result, it is crucial that power-performance trade-offs become more visible to the chip architects at the initial design stage. While it is relatively eas- ier to estimate the performance of a memory sub-system, estimating power is much more complex since it comprises dissipation inside the combinatorial logic, clock distribution network, etc. Quantitative evaluation of memory sub-system through simulator- based formal methods [1], [2] has been of little use due to its much higher level of abstraction and heavy dependence on exhaustive simulations. Lower-level power tools such as PowerMill [3] and QuickPower [4] that operate on the circuit and Verilog-level provide excellent accuracy but require before-hand knowledge of design and consume large simulation time. A method- ology for power estimation of custom blocks like SRAM, logic units like instruction selection logic, and clock network power is presented in [5]–[7], but it depends on vector-based simulations to consider use-case scenarios. Methods are developed to estimate the power of memory instances in [5] and [8]. Lot of research for providing com- plete framework to analyze memory system organizations is being done using Cache Access and Cycle Time Information (CACTI) [9], [10], but that is targeted to be used by computer architects to better understand the performance tradeoffs in building cache organizations Manuscript received July 10, 2014; accepted November 24, 2014. Date of publication January 6, 2015; date of current version June 16, 2015. This paper was recommended by Associate Editor T. Mitra. A. Chhabra, H. Rawat, and P. Kumar are with STMicroelectronics, Greater Noida 201308, India (e-mail: amit.chhabra@st.com). M. Jain is with STMicroelectronics, Scottsdale, AZ 85254 USA. P. Tessier, D. Pierredon, and L. Bergher are with STMicroelectronics, Crolles 38920, France. Digital Object Identifier 10.1109/TCAD.2014.2387859 and memory controllers. Enhancements in CACTI that are specific to nonvolatile RAM have been proposed in NVSIM [11]. Architectural level proposals for a video-active RAM are proposed in [12], but they focus on optimizing power for video coding applications. In this paper, we propose FALPEM that provides a framework that can be used for power estimation and optimization during implemen- tation phase of a memory sub-system after the memory organization and basic functionality is frozen. However, it can be used even before the register transfer level (RTL) of the memory sub-system is con- ceived. This simple and reasonably accurate method for quantitative estimation of power consumed by a memory sub-system can be use- ful in two ways: 1) it provides an estimated power value that can be readily compared with the given power budget and 2) it provides crucial information for architectural level decisions like memory clus- tering scheme, data path partitioning, pipeline depth, data and clock gating granularity. FALPEM can be used for multiple applications like frame buffers for graphics, L1 and L2 caches for processors, and ASICs in general. We have validated FALPEM on an 8 MB memory sub-system with five pipeline stages used for frame buffer applications in a graphics processor. The rest of this paper is organized as follows. In Section II, we discuss our power estimation model, and the methodology for estimating different components of total sub-system power especially the clock network power. In Section III, we demonstrate the validity of our model by comparing the estimated values with simulation figures on two test cases. In Section IV, we discuss one case study of implementation of 8 MB video-RAM macro designed in 28 nm ultra thin body and box, fully depleted silicon on insulator (UTBB FD- SOI) CMOS technology. We also discuss our observations from the model, and how we used this model to make several design decisions. II. POWER ESTIMATION MODEL The memory sub-system comprises numerous base units of SRAM instances, along with pipeline stages at its interfaces to meet the performance. Although we assume that the base SRAM units are arranged in a matrix-like structure with nY rows and nX columns, it is possible to allow some deviations in our model. We estimate the dynamic power by splitting it into four components—base SRAM instances, clock network, interconnect wires and buffers, and inter- nal power of the flip flops. These components highly depend on the floorplan, choice of metal layers, number of pipeline stages, etc. Fig. 1 depicts a sample floorplan of 2 MB block for the purpose of model description. This arrangement has eight rows (nY ) and four columns (nX) of base instances, with SRAM outputs registered along- side to ensure better performance. Four such blocks constitute the 8 MB video-RAM memory sub-system. The registers that interface with the SoC are present at the bottom. In this section, we now propose our methodology to estimate each component. A. Memory Internal Power Unlike CACTI, where Wilton and Jouppi [9] proposed methods to estimate base SRAM instance power, we have assumed the 0278-0070 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.