APC: A Novel Memory Metric and Measurement Methodology for Modern Memory Systems Dawei Wang, Member, IEEE and Xian-He Sun, Fellow, IEEE AbstractDue to the infamous memory wallproblem and a drastic increase in the number of data intensive applications, memory rather than processors has become the leading performance bottleneck in modern computing systems. Evaluating and understanding memory system performance is increasingly becoming the core of high-end computing. Conventional memory metrics, such as miss ratio, AMAT, etc., are designed to measure a given memory performance parameter, and do not reect the overall performance or complexity of a modern memory system. On the other hand, widely used system-performance metrics, such as IPC, are designed to measure CPU performance, and do not directly reect memory performance. In this paper, we propose a novel memory metric called Access Per Cycle (APC), which is the number of data accesses per cycle, to measure the overall memory performance with respect to the complexity of modern memory systems. A unique contribution of APC is its separation of memory evaluation from CPU evaluation; therefore, it provides a quantitative measurement of the data-intensivenessof an application. Simulation results show that the memory performance measured by APC captures the concurrency complexity of modern memory systems, while other metrics cannot. APC is simple, effective, and is signicantly more appropriate than existing memory metrics in evaluating modern memory systems. Index TermsMemory performance measurement, memory metric, measurement methodology 1 INTRODUCTION T HE rapid advances of semiconductor technology have driven large increases in processor performance over the past thirty years. However, memory performance has not experienced such dramatic of gains as processors; this leaves memory performance lagging far behind CPU performance. This growing performance gap between processor and mem- ory is referred to as the memory wall[1], [2]. The memory wallproblem is experienced not only in main memory but also in on-die caches. For example, in the Intel Nehalem architecture CPU, each L1 data cache has a four-cycle hit latency, and each L2 cache has a 10-cycle hit latency [3]. Additionally, the IBM Power6 has a four-cycle L1 cache hit latency and an L2 cache hit latency of 24 cycles [4]. The large performance gap between processor and memory hierarchy makes memory-access the dominant performance factor in high-end computing. Recent research tries to improve the performance of memory systems. However, understanding the performance of modern hierarchical memory systems remains elusive for many researchers and practitioners. While memory (memoryis referred to as synonym for the entire memory hierarchy for the remainder of this paper) is the bottleneck for performance, how to measure and evaluate memory systems has become an important issue facing the high performance computing community. The conventionally used performance metrics, such as IPC (Instruction Per Cycle) and Flops (Floating point operations per second), are designed from a computing-centric point-of- view. As such, they are comprehensive but affected by instruction sets, CPU micro-architecture, memory hierarchy, and compiler technologies, and cannot be applied directly to measure the performance of a memory system. On the other hand, existing memory performance metrics, such as miss rate, bandwidth, and average memory access time (AMAT), are designed to measure a particular component of a memory system or the performance of a single access of the memory system. They are useful in optimization and evaluation of a given component, but cannot accurately characterize the performance of the memory system as whole. In general, component improvement does not necessarily lead to an improvement in overall performance. For instance, when miss rate decreases, IPC may not increase, and sometimes IPC will decrease. (See Section 4.2 for details.) When non-blocking caches are used, the AMAT metric shows a negative effect on IPC. (See Section 4.2.3 for details.) Since there is no known correlation study between existing memory metrics and the nal system performance, a frequent and common question of practitioners is whether a component improvement actually leads to a system improvement. Therefore, an appropriate metric to measure memory systems is critically needed to analyze system design and performance enhancements. There are several reasons that traditional memory perfor- mance metrics cannot characterize the overall performance of a memory system. First, modern CPUs exploit several ILP (Instruction Level Parallelism) technologies to overlap ALU instruction executions and memory accesses. Out-of-order execution overlaps CPU execution time and memory access delay, allowing an application to hide the miss penalty of an L1 data cache miss that hits the L2 cache. Multithreading technology, such as SMT [5] or ne-grained multithreading [6], can tolerant even longer misses through main memory by D. Wang is with the Department of Computer Sciences, Illinois Institute Technology, Chicago, IL 60616. E-mail: david.albert.wang@gmail.com. X. Sun is with the Department of Computer Sciences, Illinois Institute Technology, Chicago, IL 60616. E-mail: sun@iit.edu. Manuscript received 07 Dec. 2011; revised 21 Dec. 2012; accepted 04 Feb. 2013. Date of publication 24 Feb. 2013; date of current version 27 June 2014. Recommended for acceptance by E. Miller. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TC.2013.38 1626 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 7, JULY 2014 0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.