Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors Dmitry Ponomarev Gurhan Kucuk Oguz Ergin Kanad Ghose Department of Computer Science State University of New York, Binghamton, NY 13902–6000 e–mail:{dima, gurhan, oguz, ghose}@cs.binghamton.edu http://www.cs.binghamton.edu/~lowpower ABSTRACT Traditional pulldown comparators that are used to implement associative addressing logic in superscalar microprocessors dissipate energy on a mismatch in any bit position in the comparands. As mismatches occur much more frequently than matches in many situations, such circuits are extremely energy–inefficient. In recognition of this inefficiency, a series of dissipate–on–match comparator designs have been proposed to address the power considerations. These designs, however, are limited to at most 8–bit long arguments. In this paper, we examine the designs of energy–efficient comparators capable of comparing arguments as long as 32 bits in size. Such long comparands are routinely used in the load–store queues, caches, BTBs and TLBs. We use the actual layout data and the realistic bit patterns of the comparands (obtained from the simulated execution of SPEC 2000 benchmarks) to show the energy impact from the use of the new comparators. In general, a non–trivial combination of traditional and dissipate–on–match 8–bit comparator blocks represents the most energy–efficient and fastest solution. As an example of this general approach, we show how fast and energy–efficient comparators can be designed for comparing addresses within the load–store queue of a superscalar processor. Categories and Subject Descriptors B.7.2 [Integrated Circuits ]: Design Aids – layout, simulation General Terms Design, Measurement Keywords Low–power comparators, superscalar datapath Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED’03, August 25–27, 2003, Seoul, Korea Copyright 2003 ACM 1–58113–682–X/03/0008...$5.00. 1. INTRODUCTION Today’s superscalar microprocessors make extensive use of associative matching logic and comparators to support out–of–order execution and virtual memory mechanisms. Comparators, either explicit or embedded into content–addressable logic, are in a pervasive use within the issue queues, load–store queues, translation lookaside buffers (TLBs), branch target buffers (BTBs), caches, reorder buffers and CAM–based register alias tables. Specifically, the long comparators (comparing upwards of 8 bits) are in wide use in today’s high–performance processor designs. They are used, either by themselves or embedded into a content–addressable logic, in at least the following key datapath components: 1) Within the translation lookaside buffer (TLB) to quickly translate virtual page numbers to physical page numbers in parallel with the access of a physical address–tagged cache. 2) Within the load–store queues (LSQ) to match the addresses of the pending load instructions against the addresses of previously dispatched store instructions to enable the loads to bypass previously dispatches stores, if the address of the load does not match the addresses of all such stores. 3) Within banks of instruction and data caches for some embedded CPUs like the Strong ARM SA 1100 (which uses a fully–associative 256–entry cache bank). 4) Within the branch target buffer (BTB) to obtain the target of a taken branch and continue fetching the instructions along the predicted path without interruption. The traditional equality comparator circuit used for implementing associative logic in modern datapaths (or, for that matter, any digital comparison) is shown in Figure 1 [4]. These so called pull–down comparators pull down a precharged output, out, on a mismatch in any bit position when the evaluation signal (eval) goes high. The precharged output remains high on a match. Energy is thus dissipated on a mismatch in the compared arguments (comparands). No dynamic energy dissipation occurs on a full match; the only possible energy dissipation in this case is attributed to leakage. Content–addressable memories (CAMs) also employ traditional dissipate–on–mismatch comparators that are embedded into the bitcells. Recent work have addressed the problem of minimizing energy dissipation in CAMs [6, 7]. In [7], the CAM words are effectively sub–banked and searches proceed on a subbank by subbank order. If a word–slice within a subbank does not match the corresponding bits in the search key, comparisons in slices of the same word in the following subbanks are disabled, thereby saving energy in extraneous comparisons. The approach of [6] extends this technique further. The common feature in both approaches is still the reliance on dissipate–on–mismatch comparators.