Design and Development of a Fault-Tolerant Multi-Threaded Acceptor-Connector Design Pattern Naghmeh Ivaki, Filipe Araujo, Fernando Barros CISUC, Dept. of Informatics Engineering, University of Coimbra, Portugal naghmeh@dei.uc.pt,filipius@uc.pt,barros@dei.uc.pt Abstract. Fault-tolerance is vital for dependable distributed applications that can deliver service, even in the presence of faults. Over the last few decades, above all protocols proposed to offer reliability and fault-tolerance, TCP grew to become one of the cornerstones of the Internet. However, despite emulating reliable communication in distributed environments, TCP does not handle connection failures when the connectivity is lost for some time, even if both endpoints are still running. When this occurs, developers must rollback the peers to some coherent state, many times with error-prone, ad hoc, or custom application-level solutions. In this report, we refine the Acceptor-Connector design pattern to tackle the TCP unreli- ability problem. The pattern decouples the failure-related processing from the connection and service processing, efficiently handling different connections and their possible crashes concurrently, thereby yielding more reusable, extensible, and efficient distributed communi- cation. The solution we propose incorporates proven multi-threaded solutions and a buffering scheme that discards the need for an application-layer acknowledgment scheme. This simpli- fies the development of reliable connection-oriented applications using the ubiquitous TCP protocol. 1 Introduction The growing importance of the Internet in people’s life and businesses, including e-commerce, financial services, health care, government, and entertainment, increases the need for large-scale dependable distributed applications. At the heart of most distributed applications, especially of those requiring reliability, we find the Transmission Control Protocol (TCP) [1]. The popularity of TCP is unquestionable: any major operating system provides a TCP/IP communication stack with Application Programming Interfaces (APIs) for a large number of programming languages. At first glance, TCP looks as a simple and powerful solution to overcome network unreliability, which is true to a certain point. However, if connectivity is lost for a period of time, the TCP connection breaks, making any kind of recovery very difficult for the endpoints. In many programs and protocols, such as FTP [2], SSH [3], TLS [4] or the X Windows System [5], it would be worthwhile to keep the interaction alive. Although many solutions for reliable communication over faulty channels exist in the literature, most of them focus on replication schemes resorting to additional configuration and hardware. The problem with all these solutions is that they either try to replace TCP or require special software or hardware that may not be readily available or mature for deployment in all platforms and languages. In fact, the number of solutions available demonstrates the difficulty of ensuring reliable communication using TCP. The asynchrony and unreliability of the network concur to complicate a timely detection of message losses. For example, one peer application may send many messages unwitting that they are not reaching the peer endpoint. If the API ever returns an error notification, the application will be unable to tell which write operations did or did not get through the channel. One possible approach would be to use a session layer to buffer TCP messages and retransmit them as necessary [6]. This however, incurs in traffic and delay overheads, besides forcing the programmer to resort to a non-TCP API. Other middleware approaches may also serve a similar purpose, by providing extra layers over TCP, but they share the same overhead and non-TCP