Chameleon: Software Infrastructure for Adaptive Fault Tolerance S. Bagchi, K. Whisnant, Z. Kalbarczyk, R.K. Iyer Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign 1308 W. Main St., Urbana, IL 61801 E-mail: [bagchi, kwhisnan, kalbar, iyer]@crhc.uiuc.edu Abstract This paper presents Chameleon, an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special ARMORs—Adaptive, Reconfigurable, and Mobile Objects for Reliability—that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: Managers oversee other ARMORs and recover from failures in their subordinates. Daemons provide communication gateways to the ARMORs at the host node. They also make available a host’s resources to the Chameleon environment. Common ARMORs implement specific techniques for providing application-required dependability. Above all else, Chameleon provides a flexible architecture through which adaptive fault tolerance may be achieved in an unreliable and heterogeneous network. Key concepts used to accomplish this goal include the automated creation of new ARMORs, the automatic extension of existing ARMORs, the seamless integration of existing and new ARMORs in existing execution strategies, and the creation of new fault-tolerant execution strategies. To our knowledge Chameleon is one of very few real implementations, which maintain fault tolerance via a software infrastructure only. Chameleon provides fault tolerance from the application point of view as well as the software infrastructure itself is fault-tolerant. To demonstrate the Chameleon environment capabilities we have implemented a prototype infrastructure, which provides set of ARMORs to initialize the environment and support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery time from failures in the user application, the Chameleon ARMORs, the hardware or the operating system. Keywords: adaptive fault tolerance, highly available networked computing, software-implemented fault tolerance, COTS, extendible modular architecture