1 Quintet, Tools for Reliable Enterprise Computing Werner Vogels, Dan Dumitriu, Mike Pantiz, Kevin Chipawolski, Jason Pettis Department of Computer Science, Cornell University † vogels@cs.cornell.edu † Quintet is part of the research performed by the Reliable Distributed Computing Group at Cornell, and is supported by DARPA/ONR under contract N0014-96-1-10014 and by Intel Corporation and Microsoft Corporation. Abstract This paper describes Quintet, a system for developing and managing reliable enterprise servers. Quintet provides tools for the distribution and replication of server components to achieve guaranteed availability and performance. It is targeted to serve the application tier in multi-tier business systems, with components constructed using Microsoft COM. Quintet takes a radical different approach from previous systems that support object replication, in that replication and distribution are no longer transparent and are brought under full control of the developer. 1 Introduction In corporate settings the general enterprise computing systems are becoming more and more organized as distributed systems. These systems are critical to the corporate operation and a strong need arises for making these systems highly reliable. The first step in addressing these needs has been taken by industry: based on their experiences with dedicated cluster environments, new cluster management software has been developed that targets off-the-shelf enterprise server systems. In general commercial cluster products provide functionality for the migration of applications from failed nodes to surviving nodes in the system. Although this offers some relief for systems such as web servers, databases or electronic mail processors, it does not facilitate the development of systems that capable of exploiting the cluster environments in all its potential. A new research project at the Reliable Distributed Systems group at Cornell addresses the problems of building reliable enterprise systems. The project, dubbed Quintet, focuses on development and runtime support for components that make up the application tier of multi-tier business systems. In our target systems this layer is constructed out of servers build as collections of COM components. Components developed using the tools provided by Quintet are able to guarantee reliable operation in a number of ways, and the system is extensible in that new mechanisms and interfaces can be added. The project is concerned with research into two areas: in the first area the quest is for what kind of development tools are needed to build reliable distributed components for enterprise computing, with a focus on efficiency, simplicity and ease of use. The second research area concentrates on the infrastructure needed for reliability management on the high performance cluster systems providing the component runtime environment. This paper first provides some background on the way Quintet views issues surrounding reliability and distribution transparency. This is necessary to understand the design choices that have been made. The section following this provides an overview of Quintet's functionality and the solutions that can be build with the Quintet tools. After a description of the target environment and relation between Quintet and MTS, the paper describes in detail, the major components that make up Quintet. 2 Reliability Component reliability in Quintet addresses two aspects of distributed computing: high-availability and scalable performance. The first is concerned with that given a limit to the number of node failures, the system guarantees that the remaining set of nodes continues to provide the required functionality. The second aspect ensures that the system, using adaptive methods, distributes the load over available resources to guarantee optimal performance. Reliability in Quintet is described using a Quality of Service specification. When a new component is added to the system, the administrator describes the reliability requirements of the component, which are input for the runtime system and for the component class factories. The specification can be changed on-line and the system can be requested to reconfigure accordingly. The most obvious approach for providing high- availability and scalable performance is to replicate components over several server nodes and to provide client fail-over and load balancing to achieve the