Experimental Assessment of Fault Coverage for Fault-Tolerant High-Performance Processors Meng-Ju Shih, Yung-Yuan Chen and Gene Eu Jan + Department of Computer Science and Information Engineering Chung-Hua University, Hsin-Chu, Taiwan E-mail: chenyy@chu.edu.tw + Department of Computer Science and Information Engineering National Taipei University, Taipei County, Taiwan E-mail: gejan@mail.ntpu.edu.tw Abstract⎯ In this paper, we present a comprehensive experimental assessment of fault coverage for a fault-tolerant VLIW processor, which consists of the error detection, error rollback recovery and reconfiguration mechanisms. We implement the proposed design of fault-tolerant VLIW in VHDL and employ the fault injec- tion to investigate the effects of fault duration, workload variation and the number of recovery process allowed on the relevant design metrics, such as performance degrada- tion, error detection/recovery coverage and fail-safe and fail-unsafe probabilities. Keywords: Error detection, error rollback recovery, fault coverage, fault injection, fault-tolerant VLIW processor. I. INTRODUCTION Intelligent systems, such as intelligent car driving systems or intelligent robots, require a stringent dependability while the systems are in operation. The recent experimental data show that the rate of radiation-induced soft errors increases rapidly especially in combinational logic while the chip fabri- cation enters the very deep submicron technology [1, 2]. Such influences raise the urgent need to incorporate the fault toler- ance into the high-performance microprocessors and systems so as to achieve the dependability requirements. Recently, the reliability issue in high-end processors is getting more and more attention [3-7]. The previous literatures lack for presenting a complete and comprehensive fault-tolerant framework for VLIW proc- essors: from error detection to error rollback recovery to re- configuration. Also, they are deficient in the detailed analysis of error-detection latency for transient faults and hardware overhead data. Finally and importantly, they are short of an effective measurement to validate the proposed fault-tolerant approaches. The measurement of fault-tolerant systems con- sists of the performance degradation, fault coverage, and the fail-safe and the fail-unsafe probabilities. The analysis also needs to characterize the effects of the fault duration, work- load variations, and the number of times allowed for rollback recovery on the fault coverage and other interesting design metrics, like reconfiguration-occurring probability and recov- ery-induced performance overhead. This work is going to address the issues stated above. The paper is organized as follows. In Section 2, we propose a fault-tolerant approach concentrating on the dependable data path design of VLIW processors. The hardware architecture and performance analysis are presented in Section 3. The ex- perimental results and discussion are given in Section 4. Sec- tion 5 concludes the paper. II. FAULT-TOLERANT DATA PATH DESIGN Nowadays, VLIW processor is a major architecture ap- proach for high-performance computing systems [8]. Several typical examples of VLIW are Intel and HP IA-64 [9], TI TMS320C62x/67x DSP devices [10] and Fujitsu FR500. So, in this study, we focus on the reliable data path design for VLIW processors. We develop a comprehensive fault-tolerant framework for high-performance VLIW processors, which consists of the error detection, rollback error recovery and reconfiguration. Error detection is based on the concept of instruction duplication to detect the errors. Rollback error re- covery exploits the checkpointing scheme to overcome the transient faults and if the faults cannot be recovered during the recovery process, then the reconfiguration is activated to iso- late the failed component. Since then, the system performs at degraded mode. A VLIW processor core may possess several different kinds of functional modules in the data paths, such as integer ALU and load/store units. A couple of identical modules are provided for a specific functional type. We assume that the register file is protected by an error-correcting code. In the following, for simplicity of presentation, we use three identi- cal ALU modules to demonstrate our fault-tolerant approach. We should note that our scheme presented here can be ex- tended easily to a generic VLIW core where the data paths have more than one functional type. Fig. 1 shows state diagram of reliable data path operation, where S0: normal state (three modules being good); S1: re- configuration state (one failed module being isolated); S2_1: recovery before reconfiguration; S2_2: recovery after recon- figuration; S3: fail-safe state. The fault-tolerant methodology