Reducing Cache Misses by Application-Speciﬁc Re-Conﬁgurable Indexing K. Patel † L. Benini # E. Macii † M. Poncino ⋄ # Universit ` a di Bologna Bologna, ITALY 40136 ‡ Politecnico di Torino Torino, ITALY 10129 ⋄ Universit ` a di Verona Verona, ITALY 37134 ABSTRACT The predictability of memory access patterns in embed- ded systems can be successfully exploited to devise eﬀective application-speciﬁc cache optimizations. In this work, we pro- pose an improved indexing scheme for direct-mapped caches, which drastically reduces the number of conﬂict misses by us- ing application-speciﬁc information; the scheme is based on the selection of a subset of the address bits. With respect to similar approaches, our solution has two main strengths. First, it models the misses analytically by building a miss equation, and exploits a symbolic algorithm to compute the exact optimum solution (i.e., the subset of address bits to be used as cache index that minimizes conﬂict misses). Sec- ond, we designed a re-conﬁgurable bit selector, which can be programmed at run-time to ﬁt the optimal cache indexing to a given application. Results show an average reduction of conﬂict misses of 24%, measured over a set of standard benchmarks, and for diﬀerent cache conﬁgurations. 1. Introduction Embedded systems oﬀer additional opportunities for perfor- mance or energy optimization with respect to their general- purpose counterparts, since their execution context (a well- deﬁned application mix) results in a higher predictability of execution and memory access patterns. Such predictability enables powerful application-speciﬁc optimizations that have been particularly successful in the design of the memory hi- erarchy (see [1, 2] for a comprehensive survey). Modern embedded processors have extensive support for hi- erarchical memory organization; for this reason, recent works have started to address the issue of application-speciﬁc cache optimizations ([4]–[6]). In this work, we propose an application-speciﬁc cache opti- mization which targets the reduction of conﬂict misses. We focus on direct-mapped caches, which are known to provide smaller access times, have simple implementation, and are conventionally used as L1 caches, especially in low-end proces- sors. In direct mapped caches, cache lines are conventionally addressed by using the least signiﬁcant bits of the memory address. For some access patterns, this indexing mechanism may incur in a large number of conﬂict misses. Our solution improves the basic indexing scheme by exploiting the possibility of proﬁle-driven optimization. The proposed indexing is based on a selection of a subset of the bits of the memory address which minimizes conﬂict misses on a given address trace. The main contributions of our work are in two areas. First, we model conﬂict misses with a Boolean condition (which analytically expresses the total number of conﬂict misses of a trace), and we propose a novel symbolic algorithm (based on BDDs and ADDs) which yields an exact solution to the optimal bit selection problem. Second, we propose an architectural extension, namely a pro- grammable index bit selector which enables run-time adap- tation of the index bit selection to diﬀerent applications or application mixes. We designed and optimized the layout of the selector, and we show that it imposes minimal overhead on cache access time and power. Results on a set of embed- ded applications have shown an average reduction of conﬂict misses of 24% on average (100% maximum, i.e., a reduction to zero misses) with respect to a conventional indexing that uses the least signiﬁcant bits. The paper is organized as follows. Section 2 reviews previous works on advanced cache indexing schemes. Section 3 de- scribes the proposed indexing algorithm. Section 4 illustrates the architecture of the programmable selector. Simulation re- sults are provided in Section 5. Finally, Section 6 draws some concluding remarks. 2. Background and Previous Work The problem of optimal cache indexing has strong analogies with optimal hashing. In fact, indexing can be viewed as mapping a set of 2 n items (the space of all possible n-bit addresses – the key space, in the hashing terminology) into a (much smaller) space of 2 m ,m<n items (the number of cache entries – the address space), with the objective of minimizing the number of conﬂicts (collisions), that is, distinct items mapping to the same entry. This task is accomplished by means of a hash function H, which maps a generic address X in the key space to a value Y = H(X) of the address space. There exists a vast theory about hash functions, with strong optimality results for per- fect hashing [3], but such advanced hashing techniques are not suitable for hardware implementation. Conventional cache indexing implements the traditional “modulo”-based hashing, in which a key X is mapped to an address Y = X mod 2 m ; this scheme guarantees a reasonable distribution of the keys over the address space, and lends it- self to a minimum-cost hardware implementation, since the modulo operation amounts to selecting the least signiﬁcant bits of the address. Several alternative hashing schemes have also been experi- mented for cache indexing. Exclusive-OR (XOR) based hash- ing functions are a quite popular solution [7, 8, 9]. Index bits are obtained by XOR-ing a given number of address bits; the general idea can be extended to Boolean operators other than 0-7803-8702-3/04/$20.00 ©2004 IEEE. 125