Active Measurement of Memory Resource Consumption Marc Casas Barcelona Supercomputing Center Jordi Girona 29, Nexus II Building 08034 Barcelona Greg Bronevetsky Lawrence Livermore National Laboratory 7000 East Avenue Livermore, CA, 94550 Abstract— Hierarchical memory is a cornerstone of modern hardware design because it provides high memory performance and ca- pacity at a low cost. However, the use of multiple levels of memory and complex cache management policies makes it very difﬁcult to optimize the performance of applications running on hierarchical memories. As the number of compute cores per chip continues to rise faster than the total amount of available memory, applications will become increasingly starved for memory storage capacity and bandwidth, making the problem of performance optimization even more critical. We propose a new methodology for measuring and modeling the performance of hierarchical memories in terms of the application’s utilization of the key memory resources: capacity of a given memory level and bandwidth between two levels. This is done by actively interfering with the application’s use of these resources. The application’s sensitivity to reduced resource availability is measured by observing the effect of interference on application performance. The resulting resource-oriented model of performance both greatly simpliﬁes application performance analysis and makes it possible to predict an application’s per- formance when running with various resource constraints. This is useful to predict performance for future memory-constrained architectures. I. I NTRODUCTION Hierarchical memory (registers, caches and main memory) is a critical driver of modern systems’ high performance because it combines small amounts of fast but expensive memory and large amounts of slower, cheaper memory to provide an excellent balance of low cost, high performance and high capacity. However, its complexity makes it very difﬁcult to achieve high performance and energy efﬁciency for real applications, a problem that has motivated signiﬁcant research on cache-friendly algorithms [8], [5] and performance analysis tools to simplify this task [19], [16], [12]. Unfortunately, The research leading to these results has received funding from the European Research Council under the European Union’s 7th FP (FP/2007- 2013) / ERC GA n. 321253. This article has been authored in part by Lawrence Livermore National Security, LLC under Contract DE-AC52-07NA27344 with the U.S. Department of Energy. Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowl- edges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. This work was partially supported by the Department of Energy Ofﬁce of Science (Advanced Scientiﬁc Computing Research) Early Career Grant, award number NA27344. even after decades of work, the goal of easy-to-use memory optimization techniques is still far from reach. Modern architectural designs provide increasing improve- ments in computation capability while maintaining a con- stant power utilization by increasing the number of cores on each chip. Since the power-efﬁciency and cost-efﬁciency memory designs are not improving at the same rate, the amount of memory per compute core is dropping [13]. This is especially true for High Performance Computing (HPC) systems, where hard limits on power costs will mean that next-generation Exascale systems may provide one or two orders of magnitude less memory capacity and bandwidth per core than today’s systems [13]. These limitations will force application designers to fundamentally rethink how their algorithms utilize the memory system and will make effective memory optimization methodologies critical for maintaining application performance on future systems. Ensuring that applications use the memory hierarchy opti- mally or restructuring algorithms to leverage hierarchies that are deeper (more levels) and thinner (fewer resources per core) requires a detailed analysis of how an application uses memory. Although there exists a wide range of tools to help with this task, they have key limitations. Simulation-based tools such as cachegrind [17] and gem5 [3] can analyze the application’s behavior in great detail and can predict the performance of any collection of applications running on any hardware conﬁguration. However, such tools run hundreds or thousands times slower than native execution and cannot simulate the commercial architectures on which almost all applications run because simulator developers have no access to their proprietary details. These limitations have motivated work on tools based on monitoring hardware performance counters. These tools report metrics such cache miss rates or instructions per cycle for various code regions [19], conduct complex statistical analyses of such counter data [12] or connect them to other aspects of the application, such as data structures [16]. Although these tools are efﬁcient and precisely capture the state of the hardware and how it is utilized by the application, this information is not actionable in most cases. First, the metrics reported by these tools are so low-level that they can only be interpreted by the most hardware-savvy de- velopers. Further, this information is not useful for predicting how the application may behave in alternate scenarios, such as