Static Typing for a Faulty Lambda Calculus David Walker Lester Mackey Jay Ligatti George A. Reis David I. August Department of Computer Science Princeton University {dpw,lmackey,jligatti,gareis,august}@princeton.edu Abstract A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While the likelihood that such transient faults will cause any significant damage may seem remote, over the last several years transient faults have caused costly failures in high-end machines at America Online, eBay, and the Los Alamos Neutron Science Center, among others [6, 44, 15]. Because susceptibility to transient faults is proportional to the size and density of transistors, the problem of transient faults will become increasingly important in the coming decades. This paper defines the first formal, type-theoretic framework for studying reliable computation in the presence of transient faults. More specifically, it defines λzap, a lambda calculus that exhibits intermittent data faults. In order to detect and recover from these faults, λzap programs replicate intermediate computations and use majority voting, thereby modeling software-based fault tolerance techniques studied extensively, but informally [10, 20, 30, 31, 32, 33, 41]. To ensure that programs maintain the proper invariants and use λzap primitives correctly, the paper defines a type system for the language. This type system guarantees that well-typed programs can tolerate any single data fault. To demonstrate that λzap can serve as an idealized typed intermediate language, we define a type-preserving translation from a standard simply-typed lambda calculus into λzap. Categories and Subject Descriptors D.3.1 [Programming lan- guages]: Formal Definitions and Theory—Semantics; B.8.1 [Hard- ware]: Reliability, Testing, and Fault-Tolerance General Terms Languages, Reliability, Theory, Verification Keywords Transient hardware faults, soft faults, type systems, typed intermediate languages, lambda calculus, fault tolerance, re- liable computing 1. Transient Faults and Trustworthy Computing In recent decades, microprocessor performance has been increas- ing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transis- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’06 September 16–21, 2006, Portland, Oregon, USA. Copyright c 2006 ACM 1-59593-309-3/06/0009. . . $5.00. tors yield performance enhancements, their lower threshold volt- ages and tighter noise margins make them less reliable [5, 23, 37]. Processors that use these transistors are more susceptible to tran- sient faults (also known as soft faults), which result from external events, such as energetic particles striking the chip. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. As each pro- cessor generation increases the density of transistors, the effects of transient faults will become more pronounced. To mitigate the deleterious effects of processor strikes, processor designers are de- voting more of their attention to the growing reliability problem. While discussions of alpha particles, neutrons, and cosmic rays interfering with earthly transistors may sound like science fiction to those unfamiliar with state-of-the-art processor design, it abso- lutely is not; transient faults are already causing substantial failures with significant costs in high-end machines. Consider, for instance, the following well-documented failures: • In 2000, Sun Microsystems acknowledged that cosmic rays interfered with cache memories and caused crashes in server systems at major customer sites, including America Online, eBay, and dozens of others [6]. • Cypress Semiconductor acknowledged, “the wake-up call came in the end of 2001 with a major customer reporting havoc at a large telephone company. Technically, it was found that a single soft fail. . . was causing an interleaved system farm to crash.” [44] • Cypress Semiconductor also states: “Another incident occurred at an automotive supplier, where their billion-dollar factory ground to a halt every month due to what was traced to a single- bit flip in their network” [44]. (Emphasis added was our own.) • At the Los Alamos Neutron Science Center, Hewlett Packard acknowledged their AlphaServer ES45 supercomputer was fre- quently crashing due to transient faults [15]. Hence, reliability in the presence of transient faults is already a significant cause for concern. Moreover, in the next 10 to 20 years, a desire to keep Moore’s law on track will continue to provide huge incentives to reduce transistor sizes even further, substantially increasing the threat of transient faults. The case for software-implemented fault tolerance. Processor designers must constantly make trade-offs to obtain the best per- formance while still meeting their constraints. With the increas- ing importance of transient fault tolerance, reliability will emerge as another critical axis that can be traded off against performance, power, and cost. However, reliability, like security, can be a more difficult sell to the general public. The number of GHz your newest processor has, the lifetime of your laptop battery, and the cost of your computing solution all attract more attention. This is particu- larly true since hardware manufacturers are generally loath to pub-