Detecting and Correcting Errors in Functional Units Performing Composable Operations Lou Scheffer Cadence ABSTRACT In the operation of a DSM chip, there is the possibility of transient errors. This paper proposes a new way to de- tect and/or correct such errors. If we must do N identi- cal composable operations, we can detect errors by doing 1 additional similar operation, and both detect and correct errors by performing about log2(N ) additional operations. For example, suppose an algorithm requires performing 1000 FFTs. With one additional FFT, we can verify that all FFTs were performed correctly. With 10 additional FFTs (performed on various linear combinations of the input data) we can detect which, if any, FFT was wrong, and compute the correct answer without re-doing the incorrect computa- tion. This result holds whether the results are computed in one cycle or many, sequentially or in parallel, or in hardware and software. Categories and Subject Descriptors J.6 [Computer Applications]: Computer Aided Engi- neering General Terms Algorithms, Performance, Design, Verification Keywords XXX XXX, XXX vXXX, SXXX tXXX, YXXX 1. INTRODUCTION AND MOTIVATION In the operation of DSM chips, there is the possibility of transient errors. These are most commonly caused cosmic rays, alpha particles, or neutrons that impinge upon the chip and cause transient data errors or upset the state of one or more flip-flops. This is commonly called Single Event Up- set, or SEU. Not surprisingly, this problem is most common in systems exposed to radiation (such as space based sys- tems) but occur (more rarely) even at ground level[6]. As Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICCAD 2003 November 9-13, 2003, San Jose, California, USA. Copyright 2003 ACM 1-58113-526-2/02/0012 ...$5.00. dimensions scale down this problem will become worse since smaller and smaller disturbances can cause these problems, and some sources (such as neutrons) cannot be eliminated by any practical amount of shielding. We would like a way to detect, and if possible correct, such errors. many such methods have been proposed, but they require significant overhead. In this paper we propose a cheaper method of performing such detection and correction. We start by observing that if we perform N identical operations on different pieces of data, we may be able to detect an error by performing the equivalent of a check- sum. The critical property is composability, which allows us to check the results of N operations by doing one ad- ditional operation. The classic composable operation is a linear one where F (a + b)= F (a)+ F (b). This implies F (a + b + c)= F (a)+ F (b)+ F (c), and so on. Thus one additional F () operation can be used to check the results of any number of F () operations. More generally, we re- quire two (possibly identical) operations and such that F (ab)= F (a)F (b). An example of a non-linear but com- posable function is exp() since exp(a + b) = exp(a) · exp(b). Are enough operations composable to make this approach worthwhile? The answer depends on the application, of course, but in many cases it seems true. Composable op- erations include any linear operation and a significant num- ber of other mathematical operations. All linear operations are composable, including such common operations such as FFTs, DCTs, wavelet transforms, and many matrix opera- tions. Speech encoding and adaptive optics are dominated by FFTs, a linear operation. Video encoding is full of DCTs (discrete cosine transforms). MPEG-2 spends up to 35% of its time in DCTs[9, 7]. Decomposition into wavelets is now common in many encodings. Many multimedia operations apply a given digital filter to many samples. All of these are linear operations, and hence composable. 1.1 Previous work It is well known 1 that almost all commonly used boolean codes can be extended to work when the symbols are real or complex numbers instead of binary digits [5]. This work extends this idea by replacing the channel with arbitrarily complex operations, provided they are composable. Another closely related work is on Algorithm Based Fault Tolerance (ABFT), introduced by Huang and Abraham[4]. As the name implies, this is intended to protect the exe- cution of a single algorithm by adding ’checksums’, usually linear functions of the input. These checksum values are 1 Among a small circle of specialists, that is.