Stack ﬁlter: Reducing L1 data cache power consumption R. Gonzalez-Alberquilla, F. Castro ⇑ , L. Pinuel, F. Tirado ArTeCS Group, University Complutense of Madrid, Spain article info Article history: Received 29 December 2009 Received in revised form 15 September 2010 Accepted 1 October 2010 Available online 25 October 2010 Keywords: Power-performance efﬁcient design Memory hierarchy Cache memory abstract The L1 data cache is one of the most frequently accessed structures in the processor. Because of this and its moderate size it is a major consumer of power. In order to reduce its power consumption, in this paper a small ﬁlter structure that exploits the special features of the references to the stack region is proposed. This ﬁlter, which acts as a top – non-inclusive – level of the data memory hierarchy, consists of a register set that keeps the data stored in the neighborhood of the top of the stack. Our simulation results show that using a small Stack Filter (SF) of just a few registers, 10–25% data cache power savings can be achieved on average, with a negligible performance penalty. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction Continuous technical improvements in the current micropro- cessors ﬁeld lead the trend towards more sophisticated chips. Nev- ertheless, this fact comes at the expense of signiﬁcant increase in power consumption, and it is well-known for all architects that the main goal in current designs is to simultaneously deliver both high performance and low power consumption. This is why many researchers have focused their efforts on reducing the overall power dissipation. Power dissipation is spread across different structures including caches, register ﬁles, the branch predictor, etc. However, on-chip caches can consume over 40% of a chip’s overall power by themselves [1,2]. One alternative to mitigate this effect is to partition caches into several smaller caches [3–5] with the implied reduction in both ac- cess time and power cost per access. Another design, known as ﬁl- ter cache [6], trades performance for power consumption by ﬁltering cache references through an unusually small L1 cache. An L2 cache, which is similar in size and structure to a typical L1 cache, is placed after the ﬁlter cache to minimize the performance loss. A different alternative, named selective cache ways [7], pro- vides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, whereas the full cache will be operational for more cache-intensive periods. Loop caches [8] are other proposal to save power, consisting of a di- rect-mapped data array and a loop cache controller. The loop cache controller knows precisely whether the next data-requesting instruction will hit in the loop cache, well ahead of time. As a re- sult, there is no performance degradation. Another different ap- proach takes advantage of the special behavior in memory references: we can replace the conventional uniﬁed data cache with multiple specialized caches. Each handles different kinds of memory references according to their particular locality character- istics – examples of this approach are [9,10], both exploiting the locality exhibited in stack accesses. These alternatives make it pos- sible to improve mainly in terms of performance. It is important to highlight that all of these approaches are just some of the existing proposals in the ﬁeld of caches design. In this paper we propose a different approach that also exploits the special features of stack references. The novelty resides in the fact that we do not employ a specialized cache for handling stack accesses; instead, we use a straightforward and small-sized struc- ture that records a few words in the neighborhood of the stack pointer, and acts like a ﬁlter: if the referenced data falls in the range stored in this ﬁlter we avoid unnecessary accesses to L1 data cache. Otherwise we perform the access like in conventional de- signs. This way, although the IPC remains largely unchanged, we are able to signiﬁcantly reduce the power consumption of the crit- ical data cache structure with negligible extra hardware. We target a high performance embedded processor [11] as platform to eval- uate our proposal, but the technique is likewise applicable to CMPs. This work is organized as follows: Section 2 describes the stack references special characteristics; Section 3 explains our proposed ﬁlter and the implementation details; Section 4 describes the setup that we have used to evaluate our proposal; Section 5 presents and analyzes the obtained experimental results; Section 6 discusses re- lated work and, ﬁnally, Section 7 concludes. 1383-7621/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2010.10.002 ⇑ Corresponding author. E-mail addresses: rgalberquilla@pdi.ucm.es (R. Gonzalez-Alberquilla), fcas- tror@pdi.ucm.es, fcastror@ﬁs.ucm.es (F. Castro), lpinuel@pdi.ucm.es (L. Pinuel), ptirado@pdi.ucm.es (F. Tirado). Journal of Systems Architecture 56 (2010) 685–695 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc