SIPT: Speculatively Indexed, Physically Tagged Caches Tianhao Zheng, Haishan Zhu, Mattan Erez Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX Email: {thzheng, haishanz, mattan.erez}@utexas.edu Abstract—First-level (L1) data cache access latency is crit- ical to performance because it services the vast majority of loads and stores. To keep L1 latency low while ensuring low-complexity and simple-to-verify operation, current proces- sors most-typically utilize a virtually-indexed physically-tagged (VIPT) cache architecture. While VIPT caches decrease la- tency by proceeding with cache access and address translation concurrently, each cache way is constrained by the size of a virtual page. Thus, larger L1 caches are highly-associative, which degrades their access latency and energy. We propose speculatively-indexed physically-tagged (SIPT) caches to enable simultaneously larger, faster, and more efficient L1 caches. A SIPT cache speculates on the value of a few address bits beyond the page offset concurrently with address translation, maintain- ing the overall safe and reliable architecture of a VIPT cache while eliminating the VIPT design constraints. SIPT is a purely microarchitectural approach that can be used with any software and for all accesses. We evaluate SIPT with simulations of appli- cations under standard Linux. SIPT improves performance by 8.1% on average and reduces total cache-hierarchy energy by 15.6%. I. I NTRODUCTION The L1 cache is the most frequently accessed structure in the memory hierarchy and therefore should have low expected access latency. As such, the L1 cache presents challenging tradeoffs between hit rate and access latency. Ac- cess latency includes the virtual memory address translation latency (TLB lookup), tag array access and matching, and the data access itself. In order to push latency down, all three components are ideally overlapped. Tag and data accesses are overlapped by accessing all ways simultaneously and deliver- ing only tag-matching data. Overlapping those two accesses with address translation is more challenging because an ac- cess can not start before the address is known. The simplest cache design indeed performs translation before L1 access begins. This design is called a physically-indexed physically- tagged (PIPT) cache because virtual addresses (VAs) are not used at all in the L1. While simple, the translation overhead is not hidden and access latency is often considered too high. Appears in HPCA 2018. c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including Add a comment to this line reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Current designs reduce latency and enable access and translation overlap in one of two ways. The first relies only on the offset bits of the VA for computing L1 array locations; the offset bits are not translated and can hence be used at the same time translation proceeds. Before data is delivered, the tag is compared to the fully translated physical address (PA). This virtually-indexed physically-tagged (VIPT) design is very ef- fective, but constrains L1 design such that the total capacity of each cache way is the system’s page size, which is commonly just 4KiB. Thus, high-associativity caches are necessary to attain a high hit rate because associativity determines the cache size. This not only increases the access latency, but also increases cache energy consumption. For example, Intel reports that 12% - 45% of core power is consumed by private caches[1]. The second solution is to translate virtual addresses (VAs) before the L1 is filled, thus accessing the cache purely with VAs. Such virtually-indexed virtually-tagged (VIVT) designs eliminate the translation latency for accessing the cache. However, relying purely on VAs for most memory accesses (as most are L1 hits) presents significant complications for cache management and coherence because software maps multiple VAs to the same physical address (synonyms) and may also map the same virtual address to multiple physical addresses (homonyms). Prior work has developed solutions, but the designs are more complicated than VIPT [2], [3]. Many current processors have adopted the simple and reli- able VIPT design including, to the best of our knowledge, all Intel x86 processors, IBM Power processors, and recent im- plementations from ARM. In this focused paper, we propose and evaluate a new option for cache indexing that relaxes the size vs. associativity tradeoff of VIPT caches while retaining the advantages of VIPT caches. Our speculatively-indexed physically-tagged (SIPT) mechanism uses simple predictors to accurately predict the cache index beyond the offset bits and uses PAs for tag matches. Thus, larger lower-associativity caches can be designed with just 1 - 3 bits being predicted. If SIPT predicts an incorrect cache index, the physical tag mismatches and the cache is accessed again with the correct bits from the now available PA. We show that misspeculation rates are low and that performance and energy are improved. The predictability of the few index bits may depend on the mapping between virtual and physical memory. However, we