Long-Term Thermal Overstressing of Computers Kirk A. Gray Accelerated Reliability Solutions Michael Pecht University of Maryland, College Park SINCE THE EARLY days of solid-state electronics, reliability engineers have been taught that the domi- nant cause of hardware unreliability is component failure and that component reliability can be as much as doubled for each 10 C reduction in temper- ature. This belief was a fundamental tenet of US Military Handbook 217 (Mil-Hdbk-217), the first document on reliability prediction of electronic com- ponents. 1 The last version of Mil-Hdbk-217 was revi- sion F, published in 1995. Since then, the document has been effectively discontinued. Nevertheless, although its predictions are inaccurate and mislead- ing, the document continues to play an influential role in the field of reliability engineering, and a few companies still use the results. 2 Although there has been no empirical data to support this belief, the concept has persisted and made its way into other commonly used reliability prediction handbooks. Furthermore, these prediction methods have relied on the analysis of insufficient failure data collected from the field, and vendors and engineers have assumed that the system components have inherent constant failure rates derived from the collected data. It has been further assumed that such constant failure rates could be tailored by independent ‘‘modi- fiers’’ to account for variations in manufacturing qual- ity, operating conditions, and temperature. In the 1990s, after a host of studies conducted by the National Institute of Standards and Technology, 2 the US Army, 3 AT&T, 4 and others, 5 it became clear that the approach propagated by these handbooks had been damaging to the electronics industry and that a change was needed. Today, the consensus is that these methods and this type of approach should never be used be- cause they are inaccurate for predicting actual field failures; moreover, they pro- vide highly misleading predictions, which can result in poor designs and poor logistics decisions. 6 Although most of these handbooks have been discontinued and are no longer used by the US military, a few manufacturers of electronic compo- nents, printed wiring and circuit boards, and elec- tronic equipment and systems still subscribe to the traditional reliability prediction techniques (e.g., Mil-Hdbk-217 and its progeny) in some manner, al- though sometimes unknowingly. Because of the use of predictions based on Mil-Hdbk-217, cooling solutions to reduce the tem- perature of components have been considered the most effective way to increase reliability. 7,8 Compo- nent manufacturers’ data sheets provide absolute maximum ratings (AMRs) and recommended oper- ating conditions (ROCs). AMRs are provided as the limit for which a part can be operated reliably, even though it might not meet electrical-function specifications. For example, Motorola states that be- tween the recommended and AMR operation limits a given part might not meet electrical specifications, but that physical failure or adverse effects on reli- ability are not expected. Yet system designers are still adding thermal-mechanical complexity and are possibly reducing system reliability by attempt- ing to cool components below the AMR threshold level in the continuing belief that doing so will im- prove reliability. In fact, a Boeing study showed that using cooling solutions to drive down system temperatures adds cost and complexity with little Thermal Overstressing Significant opportunities exist to reduce costs in the design, manufacture, and operation of systems by using temperatures higher than specified in testing systems’ reliability. The authors share the findings and observations of an experimental study in which they subjected operating computers to high steady-state temperatures and thermal cycling well beyond their design spec- ifications. The results suggest that significant cost savings can be realized with- out compromising reliability. 0740-7475/11/$26.00 c 2011 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers 58