QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL Qual. Reliab. Engng. Int. 14: 331–337 (1998) AN INVESTIGATION OF ‘CANNOT DUPLICATE’ FAILURES R. WILLIAMS 1 , J. BANNER 1 , I. KNOWLES 1 , M. DUBE 2 , M. NATISHAN 2 AND M. PECHT 2 1 Ministry of Defence (Procurement Executive), Abbey Wood, Bristol BS34 8JH, UK 2 CALCE-EPRC, University of Maryland, College Park, MD 20742, USA SUMMARY Various terms such as ‘cannot duplicate (CND)’, ‘re-test OK (RTOK)’, ‘no fault indicated (NFI)’, ‘no fault found (NFF)’, and ‘no trouble found (NTF)’, are used to describe the inability to replicate field failures during laboratory assessment. This paper uses CND to refer to all such failures. CND failures can make up more than 85% of all observed field failures in avionics and account for more than 90% of all maintenance costs. These statistics can be attributed to a limited understanding of root cause failure characteristics of complex systems, inappropriate means of diagnosing the condition of the system, and the inability to duplicate the field conditions in the laboratory. This paper addresses CND issues with reference to research carried out on samples of an electronics board used as the seat-back processor modules on board the Boeing 777. The boards were monitored continuously using existing on-board comprehensive built-in test equipment. It was found that the hot temperature operating limits of the board decreased by up to 70 C during highly accelerated environmental stress. Furthermore, improperly seated connectors were found to result in spurious component failure reports from the built-in test equipment. This paper suggests that the observed drift in operating limit and connector issues are two likely root causes of CND failures and makes recommendations for addressing them. Crown Copyright 1998. Reproduced with the permission of the Controller of Her Majesty’s Stationary Office. KEY WORDS: ‘cannot duplicate’ failure; no fault found (NFF) INTRODUCTION Failures in complex electronic systems can be extremely difficult to isolate and identify [1]. An electronic sys- tem that is observed to fail in the field often functions correctly during subsequent fault-finding activities. Var- ious terms such as ‘cannot duplicate (CND)’, ‘re-test OK (RTOK)’, ‘no fault indicated (NFI)’, ‘no fault found (NFF)’ and ‘no trouble found (NTF)’ are used to describe the phenomenon. This paper uses CND to refer to all such failures. Typical causes of CND failures include transient fail- ures due to α-particle radiation and power supply fluctua- tions [2], intermittently occurring faults due to loose con- nections, partially defective or deteriorating components and poor hardware design. CND failures also arise owing to the inability of the laboratory environment to duplicate the field load conditions exactly and owing to the self- healing of failures such as solder joint cracks during testing. In certain cases the occurrence of CND indicates the use of inappropriate diagnostic procedures [1] or that the total fault spectrum is larger than fault coverage [3]. The problems associated with transients and intermit- tent failures have been recognized since the 1960s [4]. Since then, investigations have been conducted into ad- dressing CND issues [5,6], improved on-line monitoring techniques [5,7], modelling of intermittent failures [8–10] and design of fault-tolerant systems with built-in redun- dancy [2]. However, CND remains the major source of malfunc- tion in digital systems [11,12]. Such failures have been observed in diverse applications such as trucking [13], Correspondence to: I. Knowles, Ministry of Defence (Procurement Executive), Abbey Wood, Bristol BS34 8JH, UK. and avionics [14,15], where their proportion can be as high as 85%. They account for more than 90% of the total maintenance expense [16,17]. As the cause of failure is unknown, repair is impossible. Consequently, fully functional units are often replaced, leading to ineffective cost management [13], or worse, maintenance practices which permit returning a potentially faulty unit to the field can be a safety hazard. This paper discusses the phenomenon of CND failures in electronic systems based upon the findings from highly accelerated life-cycle testing (HALT) carried out at the QualMark Corporation on the seat-back processor mod- ule for the Boeing 777. The board included a built-in test (BIT) which was used for continuous on-line monitoring of the board and diagnosis of failures during environmen- tal stress testing. The findings are discussed in terms of equipment operating limits, the nature of ‘soft’ failures, the performance of the BIT and the observed effects on these of accumulated damage. Finally, the implications of these findings are addressed with respect to providing a road-map to deal with CND failures in the repair process. DEFINITIONS Electronic equipment is generally designed to specifica- tions which include the range or limits of environmental and operating stresses such as temperature, humidity and vibration. This range is called the specification limit. The stress margin which is designed into the equip- ment, such that the equipment will function correctly beyond the specification limit, is called the operating limit for the equipment. Outside the operating limit the equipment may show failures due to a shift in perfor- mance characteristics (e.g. slew rate, voltage thresholds, Crown Copyright 1998. Reproduced with the permission Received 21 November 1997 of the Controller of Her Majesty’s Stationary Office Revised 19 March 1998