In the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2005 Instruction Packing: Reducing Power and Delay of the Dynamic Scheduling Logic Joseph J. Sharkey, Dmitry V. Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 {jsharke, dima, ghose}@cs.binghamton.edu Oguz Ergin* Intel Barcelona Research Center Intel Labs, UPC Barcelona, Spain oguzx.ergin@intel.com ABSTRACT The instruction scheduling logic used in modern superscalar microprocessors often relies on associative searching of the issue queue entries to dynamically wakeup instructions for the execution. Traditional designs use one issue queue entry for each instruction, regardless of the actual number of operands actively used in the wakeup process. In this paper we propose Instruction Packing – a novel microarchitectural technique that reduces both the delay and the power consumption of the issue queue by sharing the associative part of an issue queue entry between two instructions, each with at most one non-ready register source operand at the time of dispatch. Our results show that Instruction Packing provides a 39% reduction of the whole issue queue power and 21.6% reduction in the wakeup delay with as little as 0.4% IPC degradation on the average across the simulated SPEC benchmarks. Categories and Subject Descriptors C.1.1 [Computer Systems Organization]: Processor Architecture – Pipeline processors General Terms: Design, Performance, Measurement Keywords: Issue Queue, Low Power, Instruction Packing 1. INTRODUCTION Contemporary superscalar microarchitectures rely on aggressive out-of-order instruction scheduling mechanisms to maximize performance across a wide variety of applications. Instructions are dispatched into the issue queue (IQ) in-order after undergoing register renaming and noting the availability of their register sources, and they are issued for the execution out of order as their source operands become available. The instruction scheduling logic operates in two phases – instruction wakeup and instruction selection. During wakeup, the destination tags of the already scheduled instructions are broadcasted across the IQ and each IQ entry associatively compares the locally stored source register tags against the broadcasted destination tag. On a match, the corresponding source is marked as ready. When all sources of an instruction become ready, the instruction wakes up and becomes eligible for selection. The selection logic then selects W out of possibly N eligible instructions, where N is the size of the IQ and W is the processor’s issue width. In order to uncover sufficient instruction-level parallelism, the IQs have to be generously sized, resulting in significant power/energy dissipations in the course of accessing the queue. A large amount of energy is dissipated when the destination tags are broadcasted on the tag busses due to the charging and discharging of the wire capacitance of the tag line itself and the gate capacitance of the devices that implement the tag comparators. Additionally, the power is also dissipated in the course of instruction dispatching (writing the instructions into the IQ) instruction selection, and instruction issuing (reading the instructions out of the queue). Several researchers have reported the IQ power to account for 20%-25% of total chip power [17, 35]. In a traditional RISC-like processor where each instruction can have at most two register source operands, each IQ entry has two comparators, which allow the instruction to track the arrival of both sources by monitoring the tag buses. In general, however, such a design results in a grossly inefficient usage of the comparators, because of two reasons: 1) Many instructions have only one source register operand, and therefore do not require the use of two tags (and two comparators) in the first place, and 2) of the instructions with two register source operands, a large percentage have at least one of these operands ready at the time of dispatch, again rendering the second comparator unnecessary. Our simulations showed that, on the average across SPEC 2000 benchmarks, about 83% of the dynamic instructions enter the scheduling window with at least one of their source operands ready, as shown in Figure 1. In this paper, we propose to optimize the use of the CAM logic (comparators) within the IQ by the opportunistic packing of two instructions into the same IQ entry, effectively duplicating some of the RAM storage for these instructions (destination register addresses, literals, opcodes) and sharing the existing comparators. When an instruction, which has at least one of its two source registers ready at the time of dispatch, is placed in a traditional IQ entry with two comparators, one of these comparators (the one corresponding to the ready source) is not used. By placing two such instructions within the same IQ entry, such wastages can be reduced and in many cases totally eliminated. Instruction packing reduces the number of IQ entries and thus the length of and the capacitive loading on the tag buses and the bitlines. Consequently, significant energy reduction and faster operation of the wakeup logic are achieved (as the queue now has a significantly smaller number of entries) with negligible degradation in the IPC and just a slight increase in the delay of the selection logic. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. * This work was done while Oguz Ergin was at SUNY Binghamton. ISLPED’05, August 8–10, 2005, San Diego, California, USA. Copyright 2005 ACM 1-59593-137-6/05/0008…$5.00.