Reliability and Performance of Component Based Software Systems with Restarts, Retries, Reboots and Repairs Vibhu Saujanya Sharma Dept. of Computer Science and Engineering, Indian Institute of Technology Kanpur, Kanpur, UP, INDIA 208016 vsharma@cse.iitk.ac.in Kishor S. Trivedi Dept. of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, USA kst@ee.duke.edu Abstract High reliability and performance are vital for soft- ware systems handling diverse mission critical applica- tions. Such software systems are usually component based and may possess multiple levels of fault recovery. A number of parameters, including the software architecture, behav- ior of individual components, underlying hardware, and the fault recovery measures, affect the behavior of such systems, and there is a need for an approach to study them. In this paper we present an integrated approach for modeling and analysis of component based systems with multiple levels of failures and fault recovery both at the software, as well as the hardware level. The approach is useful to analyze attributes such as overall reliability, performance, and ma- chine availabilities for such systems, wherein failures may happen at the software components, the operating system, or at the hardware, and corresponding restarts, retries, re- boots or repairs are used for mitigation. Our approach en- compasses Markov chain, and queueing network modeling, for estimating system reliability, machine availabilities and performance. The approach is helpful for designing and building better systems and also while improving existing systems. 1 Introduction Software systems these days are being used in diverse fields and handle many mission and time critical jobs. It is important for such systems to be highly reliable and re- sponsive. As these systems are mostly component based, important attributes like reliability and performance depend on the characteristics of the individual components, the way they interact with each other, and upon the underlying hard- ware infrastructure on which the components are deployed. Moreover, as failures can happen at the software compo- nents as well as the hardware, the way in which these fail- ures are resolved, also has a direct bearing on the overall reliability and performance. Failures at software components are usually resolved by rebooting their respective machines, and restarting the sys- tem. However this adversely affects the performance and also makes the system unavailable. Recent empirical stud- ies [2, 3] show that successfully restarting just the software components (as opposed to rebooting the machines) is an effective way to handle transient software failures and in- crease system reliability, and simultaneously reduce the per- formance overhead. Other levels of fault recovery can also be present [27], and these affect the performance as well as the reliability of the system. As the overall behavior of such complex component based systems depends on a number of different factors, modeling and analyzing such systems for attributes like re- liability and performance has become important to ensure their efficient and sound operation. If such an analysis can be performed early in the software life-cycle, it can facilitate in making key decisions regarding the software design so that the final product performs better. Similarly, this activ- ity is equally important for existing systems to help improve them. In general, questions such as these become pertinent, while studying such systems: How does the system perform if one or more software components are unreliable ? How does unreliable underlying hardware affect the system, and where to improve ? How do multiple fault recovery measures such as restarts, retries, reboots and repairs, affect the system reliability and performance? What are the various tradeoffs that exist ? Answering such questions, requires an approach that takes into account the software architecture and deployment 17th International Symposium on Software Reliability Engineering (ISSRE'06) 0-7695-2684-5/06 $20.00 © 2006