On Graceful Degradation of Chip Multiprocessors in Presence of Faults via Flexible Pooling of Critical Execution Units Rance Rodrigues and Sandip Kundu Department of Electrical and Computer Engineering University of Massachusetts at Amherst {rodrigues, kundu}@ecs.umass.edu Abstract Reliability and manufacturability have emerged as -billion transistor chips. In this paper, we investigate how to degrade a chip multiprocessor (CMP) gracefully in presence of faults, by keeping its architected functionality intact at the expense of some loss of performance. The proposed solution involves sharing critical execution resources among cores to survive faults. Recent research has suggested that large datapath units such as FPU and integer division units are good candidates for execution outsourcing to other working cores in CMP. In this paper, we focus on relatively small but critically important integer ALU unit. Outsourcing ALU operations incur large performance penalty and better solutions need to be in place to ensure survivability with minimal performance loss. We propose the provisioning of a shared ALU among a set of cores that can act as a spare for any constituent core in the group. This solution works well for single ALU failures, but leads to resource contention when multiple ALUs fail. Simulation case studies on MediaBench and MiBench benchmarks show that the proposed solution allows the CMP to remain functionally intact with no performance penalty for single ALU failures and no more than 1.5% performance loss on average for failure of single ALU in each core. Keywords- Reliability; fault tolerance; dynamic hardware sharing; critical instruction execution unit; performance impact. I. INTRODUCTION Recent advances in manufacturing technology have permitted integration of unprecedented number of transistors on a chip. However advances in integration has been somewhat mitigated by increased susceptibility to defects [1] [2] [12]. Previous fault tolerance solutions for microprocessors are expensive either in terms of area or power or performance. This motivates new solutions that are efficient in area, power and performance. Apart from manufacturing defects, reliability is also becoming a rising concern for microprocessors [1] [6] [10] [12]. Reliability defects are latent during manufacturing test, but manifest over time. This motivates the need for a defect tolerant strategy that can adapt itself in the field. Defects result in errors. When errors are detected, a correction scheme must be in place to disable defective areas of a chip. This presumes existence of logic in a chip that can make-up for lost functionality. Traditionally, explicit redundancies have been added to make such recovery possible [13][15][25]. In this paper, we present a solution for a modern superscalar processor in a chip multiprocessor configuration that inherently features various forms of redundancy. Thus purely from a functional perspective, defect avoidance may be possible without any extra hardware added. However, this may result in loss of performance. We are also interested in preserving performance of the processor as much as possible, provided the added area/performance cost is insignificant. This is the subject of this paper. Our solution is based on a layered approach. A microprocessor consists of various structure classes, such as large arrays, small arrays, datapath and control logic. Previous publications address defect recovery in many of these structure classes [13][15][18][25]. The large arrays such as memories and caches are protected using ECC [14] or by redundancy. Array structures like the ROB, TLB, Issue queues may be protected by spare rows [15] or other techniques [25]. Similarly control logic may be protected by addition of simple checker cores to superscalar cores [1]. The datapath consists of integer (INT)/FP dividers, multipliers and ALUs and is responsible for execution of instructions. For faults in the larger of these datapaths such as multipliers, dividers and FP ALU, an outsourcing scheme was shown to be sufficient for chip multiprocessors [13]. The success of this outsourcing strategy is predicated on the observations that such instructions occur infrequently, while they also have higher latencies of execution. Overheads of just 10% are reported. Unfortunately, the same scheme cannot be used to protect the INT ALUs as these units are used far more frequently and have lower latency of execution [5]. Hence outsourcing results in large performance penalties. We substantiate this claim with experiments using the SESC architectural simulator [19] against MediaBench [20] and MiBench [21] benchmarks. We measured performance drop when the execution latency for INT ALU units was increased progressively from 1 through 20 cycles. The results are averaged for all considered workloads (see Section V for workloads considered) and are plotted in figure 1. It may be seen that even 2-3 cycles of additional latency in computation for these instructions results in 20-25% loss of performance. This experiment shows that the INT ALU units are the most performance critical of the all the datapath units. Hence outsourcing is not a feasible solution. One possible solution is to add gross redundancy. We offer a better alternative in this paper Fig. 1. Effect of increase INT ALU execution latency on overall performance. 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Percentage performance with respect to single cycle ALU Number of cycles to execute INT ALU instructions 67 978-1-4577-1056-8/11/$26.00 c 2011 IEEE 2011 IEEE 17th International On-Line Testing Symposium