A Table-based Method for Single-Pass Cache Optimization Pablo Viana Federal University of Alagoas Arapiraca-AL, Brazil pablo@lccv.ufal.br Ann Gordon-Ross University of Florida Gainesville-FL, USA ann@ece.ufl.edu Edna Barros Federal University of Pernambuco Recife-PE, Brazil ensb@cin.ufpe.br Frank Vahid University of California, Riverside Riverside-CA, USA vahid@cs.ucr.edu ABSTRACT Due to the large contribution of the memory subsystem to total system power, the memory subsystem is highly amenable to cus- tomization for reduced power/energy and/or improved performance. Cache parameters such as total size, line size, and associativity can be specialized to the needs of an application for system optimiza- tion. In order to determine the best values for cache parameters, most methodologies utilize repetitious application execution to in- dividually analyze each configuration explored. In this paper we propose a simplified yet efficient technique to accurately estimate the miss rate of many different cache configurations in just one single-pass of execution. The approach utilizes simple data struc- tures in the form of a multi-layered table and elementary bitwise operations to capture the locality characteristics of an application’s addressing behavior. The proposed technique intends to ease miss rate estimation and reduce cache exploration time. Categories and Subject Descriptors B.3 [Memory Structures]: Performance Analysis and Design Aids General Terms Algorithms Keywords Configurable cache tuning, cache optimization, low energy. 1. INTRODUCTION Optimization of system performance and power/energy consump- tion is an important step during system design and is accomplished through specialization, or tuning, of the system. Tunable parame- ters include supply voltage, clock speed, bus width and encoding schemes, etc. Of the many tunable parameters, it is well known that one of the main bottlenecks for system efficiency resides in the memory sub-system (all levels of cache, main memory, buses, etc) [16]. The memory subsystem can attribute to as much as 50% of total system power [1, 17]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’08, May 4–6, 2008, Orlando, Florida, USA. Copyright 2008 ACM 978-1-59593-999-9/08/05 ...$5.00. Memory subsystem parameters such as total size, line size, and associativity can be tuned to an application’s temporal and spatial locality to determine the best cache configuration to meet optimiza- tion goals [3]. However, the effectiveness of such tuning is de- pendent on the ability to determine the best cache configuration to complement an application’s memory addressing behavior. To determine a cache size that yields good performance and low energy for an application, the size must closely reflect the temporal locality needs of an application. It is important to determine how frequently memory addresses are accessed and how long it takes for an executing application to access the same memory reference again. This property is mostly attributed to working-set character- istics such as loop size. Similarly, the cache line size must closely reflect the spatial lo- cality of an application, which is present in straight-line instruc- tion code and data array accesses. Additionally, associativity must closely reflect the needs of the application. To determine the best values for these tunable parameters, or best cache configuration, existing cache evaluation techniques include analytical modeling [6, 10] and execution-based evaluation [4] to evaluate the design space. Analytical models evaluate code char- acteristics and designer annotations to predict an appropriate cache configuration in a very short amount of time, requiring little de- signer effort. Whereas this method can be accurate, it can be diffi- cult to predict how an application will respond to real-world input stimuli. A more precise technique is execution-based evaluation. In this technique, an application is typically simulated multiple times, and through the use of a cache simulator, application performance and/or energy are evaluated for each cache configuration explored. Whereas this technique is more accurate than an analytical model, modern embedded systems are becoming more and more complex and sim- ulating these applications for numerous cache configurations can demand a large amount of design time. To accelerate execution- based evaluation, specialized caches have been designed that allow for cache parameters to be varied during runtime [2, 14, 19]. How- ever, due to the intrusive nature of the exploration heuristics, the cache must be physically changed to explore each configuration. Exploring a large number of cache configurations can poten- tially significantly adversely effect program execution in terms of energy and performance overhead while exploring poor configura- tions. To reduce the number of configurations explored, efficient heuristics have been proposed [8, 19] to systematically traverse the configuration space and result in a near-optimal cache configura- tion while evaluating only a fraction of the design space. However, even though the number of cache configurations is greatly reduced, in some systems, tens of cache configurations may need to be ex- plored, thus still potentially imposing a large overhead and con-