Scalable Verification of Distributed Systems Implementations via Messaging Abstraction * Can Arda Muftuoglu † , Habib Saissi † , P´ eter Bokor †‡ , and Neeraj Suri Technische Universit¨ at Darmstadt {arda, habib, pbokor, suri}@cs.tu-darmstadt.de 1 Approach Overview Motivation A number of verification approaches attempt proving liveness and safety properties in distributed sys- tem models and the more complex target of ensuring these in actual implementations. A typical approach is to verify a simplified model of the system and argue, mostly infor- mally, that this result applies to the implementation, i.e. the unmodified system. Hence, the implementation of the system is not verified and can contain unrevealed bugs. One major challenge of verifying a distributed system is to capture its (global) state, which is hard, or impossible, in general. Current approaches apply snapshot algorithms to capture slices of the system state [7]. As this consider- ably complicates the design of the verifier, the complexity of distributed snapshot can be mitigated by mapping pro- cesses running on different (and physically remote) ma- chines into processes running within a single machine [5]. Even if capturing the global state can be done effi- ciently, the verification of distributed systems suffers from state space explosion due to concurrency and possible faults. Also, the unmodified system containing implemen- tation details further increases complexity. As a result, the verification of unmodified distributed systems has so far been limited to debugging [7] or to small systems [3]. Proposing messaging abstraction We propose to decou- ple the distributed system operations into (a) (system spe- cific) process-level operations and (b) (system indepen- dent) communication elements. As a result, the sepa- rated communication layer can be substituted with ab- stractions that are more amenable to verification. In the context of message-passing systems, we refer to this ap- proach as messaging abstraction. Messaging abstraction is sound if the specification of the system only depends on certain properties of message-passing (e.g., FIFO or reliable channels) but not on how the properties are im- plemented. Given the implementation of these properties (e.g., via TCP connections), it can be verified once for all. * Demo with MP-Basset [8] is available at the conference. † Student. ‡ Presenting author. Messaging abstraction also enables fast prototyping new systems with different communication models. For example, the messaging abstraction can be easily changed from reliable FIFO channels to lossy channels with pos- sible out-of-order message delivery. Such changes can be tedious with real implementations. Results overview In the following case study, we sub- stantiate our approach by model checking implementa- tions of complex state machine replication (SMR) proto- cols, namely Paxos [6] and Zab [4]. We implement the concept of messaging abstraction within the MP-Basset model checker [8]. Our experiments show that, thanks to messaging abstraction, MP-Basset is able to quickly dis- prove false properties of both protocols. In particular, our approach was able to identify the failure of Paxos SMR to preserve the order of user operations as submitted to the states machine, and also Zab to ensure liveness. 2 A Case Study MP-Basset is a model checker for message-passing sys- tems [8]. Based on the Java Pathfinder model checker, MP-Basset allows the local program of a process to be written in Java. Messaging abstraction in MP-Basset is simply a queue of messages where each message is a triple of the sender, the recipient, and the content of the message. Processes use two primitives for sending and receiving messages through the messaging abstraction layer. Firstly, a process can call the predefined method send(...) to send a message. Secondly, if a mes- sage is available in the queue, the recipient process can specify a condition (or guard) as of when the message is consumed. The condition depends on the content of the message and the local state of the process. System examples: Paxos SMR & Zab We consider two state machine replication (SMR) protocols and model check them using messaging abstraction. We choose SMR for its complexity and generality, given that SMR can be used to replicate an arbitrary service. The first SMR pro- tocol is based on the Paxos consensus algorithm [6]. The second protocol is using the Zab atomic broadcast algo- rithm [4]. Zab is part of the Zookeper protocol suite, 1