Optimal structure of multi-state systems with multi-fault coverage Rui Peng a,n , Huadong Mo b , Min Xie b , Gregory Levitin c a Dongling School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China b Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong c The Israel Electric Corporation Ltd., Haifa, Israel article info Article history: Received 12 October 2012 Received in revised form 19 April 2013 Accepted 3 May 2013 Available online 16 May 2013 Keywords: Genetic algorithm Imperfect fault coverage Reliability Multi-fault coverage Fault level coverage Universal generating function abstract Due to imperfect fault coverage, the reliability of redundant systems cannot be enhanced unlimitedly with the increase of redundancy. Thus it is essential to study the optimal structure of redundant systems. This paper considers a multi-state series-parallel system with two types of parallelization: redundancy and work sharing. Different from existing works which consider single-fault coverage, multi-fault coverage is considered in order to adapt to a wider range of fault tolerant mechanisms. For multi-fault coverage, the coverage factor of an element failure in a work sharing group depends on the status of other elements. It is assumed that the uncovered failures in the elements belonging to the group of elements sharing the same task can cause failure of the entire group. The optimal trade-off between the two kinds of parallelization has been studied based on various settings of fault coverage factor. Examples of data transmission systems and task processing systems are presented to illustrate the applications of results. & 2013 Elsevier Ltd. All rights reserved. 1. Introduction Fault tolerance is widely used to enhance system reliability, especially for systems with stringent reliability requirements, such as nuclear power controllers and ﬂight control systems [12,8,22,20]. However, as the fault and error handling mechanisms (detection, location, and isolation) themselves can fail, some failures can remain undetected or uncovered, which can lead to total failure of the entire system or its sub-systems [17,26,13]. Examples of this effect of uncovered faults can be found in computing systems, electrical power distribution networks, phased mission systems etc [5,27,24]. The probability of successfully covering a fault (avoiding fault propagation) given that the fault has occurred is known as the coverage factor [4,1,2]. Due to the existence of different fault covering mechanisms, different coverage models have been stu- died in literatures [25,18,19]. Among these models, element level coverage (ELC) model and fault level coverage (FLC) model are the most important and widely studied. For ELC, the coverage prob- ability of each system component is independent from the status of other components. ELC is typical for systems containing a built- in test (BIT) capability, where the selection among the redundant elements is made on the basis of a self-diagnostic capability of the individual elements. For FLC, the coverage probability of a system element depends on the number of failed elements. In other words, the selection among redundant elements varies between initial and subsequent failures. In the HARP terminology [3], ELC models are known as single-fault models, whereas FLC models are known as multi-fault models. Multi-fault models have the ability to model a wide range of fault tolerant mechanisms. An example is a majority voting system among the currently known working elements, see Myers and Rauzy [18]. Due to imperfect fault coverage, the system reliability can decrease with increase of redundancy over some particular limit [11,17]. As a result the system structure optimization problems arise. Some of these problems have been formulated and solved for parallel systems, k-out-of-n systems [1,2]. Levitin [10] presents a model of series-parallel multi-state systems (MSS) with two types of task parallelization: parallel task execution with work sharing, and redundant task execution. A framework to solve the optimal balance of the two kinds of parallelization which max- imizes the system reliability is proposed based on the assumption that the ELC applies in each work sharing group. Considering the different types of fault handling mechanisms in practice, the ELC model alone cannot adapt to all the cases. Though Levitin and Amari [11] proposed a way to evaluate the reliability of MSS considering FLC, the system structure optimization problem was Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/ress Reliability Engineering and System Safety 0951-8320/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ress.2013.05.007 Abbreviations: BIT, built-in test; ELC, element level coverage; FLC, fault level coverage; MSS, multi-state system; pmf, probability mass function; GA, genetic algorithm; WSG, work sharing group (group of elements affected by uncovered failures); UGF, universal generating function. n Corresponding author. Tel.: +86 1305 154 0519. E-mail addresses: pengrui1988@gmail.com, ruipeng@mail.ustc.edu.cn (R. Peng). Reliability Engineering and System Safety 119 (2013) 18–25