Tolerating Byzantine Faulty Clients in a Quorum System Barbara Liskov MIT CSAIL Cambridge, MA, USA Rodrigo Rodrigues INESC-ID / Instituto Superior T´ ecnico Lisbon, Portugal Abstract Byzantine quorum systems have been proposed that work properly even when up to f replicas fail arbitrarily. How- ever, these systems are not so successful when confronted with Byzantine faulty clients. This paper presents novel protocols that provide atomic semantics despite Byzantine clients. Our protocols prevent Byzantine clients from inter- fering with good clients: bad clients cannot prevent good clients from completing reads and writes, and they cannot cause good clients to see inconsistencies. In addition we also prevent bad clients that have been removed from oper- ation from leaving behind more than a bounded number of writes that could be done on their behalf by a colluder. Our protocols are designed to work in an asynchronous system like the Internet and they are highly efficient. We require 3f +1 replicas, and either two or three phases to do writes; reads normally complete in one phase and require no more than two phases, no matter what the bad clients are doing. We also present strong correctness conditions for sys- tems with Byzantine clients that limit what can be done on behalf of bad clients once they leave the system. Further- more we prove that our protocols are both safe (they meet those conditions) and live. 1 Introduction Quorum systems [4, 13] are valuable tools for building highly available replicated data services. A quorum system can be defined as a set of sets (called quorums) with certain intersection properties. These systems allow read and write operations to be performed only at a quorum of the servers, since the intersection properties ensure that any read op- eration will have access to the most recent value that was written. The original work on quorum systems assumed that servers fail benignly, i.e., by crashing or omitting some steps. More recently, researchers have developed tech- niques that enable quorum systems to provide data avail- ability even in the presence of arbitrary (Byzantine) faults [9]. Earlier work provides correct semantics despite server (i.e., replica) failures and also handles some of the problems of Byzantine clients [2, 3, 5, 9, 10, 11, 12]. This paper extends this earlier work in two important ways. First, it defines new protocols that efficiently han- dle more problems caused by Byzantine clients than pre- vious approaches. Our protocols compare favorably to all previous proposals: they either rely on weaker assumptions (e.g., about the network), or they are more efficient in terms of operation latency and number of replicas. Second, the paper formally defines novel correctness conditions for Byzantine quorum systems, and proves that our protocols meet such conditions. The correctness condi- tions are stronger than what has been stated previously [11] and what has been guaranteed by previous approaches. Since a dishonest client can write garbage into the shared variable, it may seem there is little value in limiting what bad clients can do. But this is not the case, for two reasons. First, bad clients can cause a protocol to misbehave so that good clients are unable to perform operations (i.e., the pro- tocol is no longer live) or observe incorrect behavior. For example, if the variable is write-once, a good client might observe that its state changes multiple times. Second, bad clients can continue to interfere with good ones even after they have been removed from operation, e.g., by a system administrator who learns of the misbe- havior. We would like to limit such interference so that, after only a limited number of writes by good clients, any lurking writes left behind by a bad client will no longer be visible to good clients. A lurking write is a modification launched by the bad client before it was removed from op- eration that will become visible (possibly with help from an accomplice) after it has left the system. By limiting such writes we can ensure that the object becomes useful again after the departure of the bad client, e.g., some invariant that good clients preserve will hold. Of course, it is not possible to prevent actions by a bad client, even if it has been shut down, if the private key it uses to prove that it is authorized to modify an object can be used by other nodes; thus, we consider a bad client to be in the system as long as any node knows its private key. (In practice this problem might be handled by an administrator Proceedings of the 26th IEEE International Conference on Distributed Computing Systems (ICDCS’06) 0-7695-2540-7/06 $20.00 © 2006 IEEE