Experimental Assessment of COTS DBMS Robustness under Transient Faults Diamantino Costa †‡ Henrique Madeira CISUC–Centro de Informática e Sistemas da UC Departamento de Engenharia Informática University of Coimbra, P-3030 Portugal University of Coimbra, P-3030 Portugal dino@dei.uc.pt henrique@dei.uc.pt † Research supported in part by Fundação para a Ciência e Tecnologia - PRAXIS XXI under grant number BD/5636/95. ‡ On leave from Critical Software, www.criticalsoftware.com. Abstract This paper evaluates the behavior of a common off-the- shelf (COTS) database management system (DBMS) in presence of transient faults. Database applications have traditionally been a field with fault-tolerance needs, concerning both data integrity and availability. While most of the commercially available DBMS provide support for data recovery and fault-tolerance, very limited knowledge was available regarding the impact of transient faults in a COTS database system. In this experimental study, a strict off-the-shelf target system is used (Oracle 7.3 server running on top of Wintel platform), combined with a TPC-A based workload and a software implemented fault injection tool – XceptionNT. It was found out that a non-negligible amount of induced faults - 13% - lead to database server hanging or premature termination. However, the results also show that COTS DBMS products has a reasonable behavior concerning data integrity - none of the injected faults affected end user data. 1. Introduction Database applications have traditionally been a field with fault-tolerance needs, concerning both data integrity and availability. Furthermore, several mechanisms needed to achieve fault-tolerance, such as transactions, checkpointing, log files, and replica control management, have been developed or improved on the databases behalf. The benefits that would come out of experimental fault- injection enabling validation, evaluation and fine tuning of those techniques are recognized by the database research community [1]. Most of the commercially available database management systems (DBMS) provide support for data recovery and fault-tolerance, even when the underlying hardware platforms do not have any fault-tolerance features. However, few works in the literature address the evaluation/validation of those fault-tolerant techniques [2][3][4] and, to the best of our knowledge, no one has addressed yet the evaluation of the impact of transient faults in a common off-the-shelf (COTS) database system. Most of the dependability enforcing solutions available in commodity DBMS address permanent faults and assume a fail-stop model for the system. Little or few interest has been given to transient hardware faults. While it is true that software and operations are becoming major sources of service disrupt, transient hardware faults still account for a non-negligible stake of computer system failures. These figures tend to have much more relevance in presence of COTS systems with no hardening features or specific build- in fault-tolerance mechanisms. The increasing trend of using COTS technology (DBMS included) on mission critical and business critical systems pushes even further the interest for dependability evaluation of such systems. Will COTS savings mean less robustness? And how much “less” it will be? This paper gives an insight on the behavior in presence of transient faults of a DBMS that was strictly taken off- the-shelf. The target system consists of an Oracle 7.3 server running on top of WindowsNT 4.0 operating system on an Intel based P6 hardware platform, loaded over the network with “clients” running a TPC-A like benchmark [5]. One clear difficulty in the evaluation of the impact of faults in a COTS database is the huge complexity of existing DBMS. Software-implemented fault injection techniques (SWIFI) are probably the best alternative to accomplish this evaluation. However, the injection of transient faults in a complex DBMS, especially in a COTS system in which the source code is not available, is a difficult task. The goal of injecting low level faults that directly emulate transient hardware faults (and indirectly induce erroneous system behavior similar to the one induced by software errors) requires minimal intrusion of the SWIFI tool. The fault injection tool should not disturb the transaction timing, as the TPC benchmark used in the experiments emulates a typical interactive database application (which excludes running the server in trace