Persistent 9P Sessions for Plan 9 Gorka Guardiola, paurea@gmail.com Russ Cox, rsc@swtch.com Eric Van Hensbergen, ericvh@gmail.com ABSTRACT Traditionally, Plan 9 [5] runs mainly on local networks, where lost connections are rare. As a result, most programs, including the kernel, do not bother to plan for their file server connections to fail. These programs must be restarted when a connection does fail. If the kernel’s connection to the root file server fails, the machine must be rebooted. This approach suffices only because lost connections are rare. Across long distance networks, where connection failures are more common, it becomes woefully inadequate. To address this problem, we wrote a program called recover, which proxies a 9P session on behalf of a client and takes care of redialing the remote server and reestablishing con- nection state as necessary, hiding network failures from the client. This paper presents the design and implementation of recover, along with performance benchmarks on Plan 9 and on Linux. 1. Introduction Plan 9 is a distributed system developed at Bell Labs [5]. Resources in Plan 9 are presented as synthetic file systems served to clients via 9P, a simple file protocol. Unlike file protocols such as NFS, 9P is stateful: per-connection state such as which files are opened by which clients is maintained by servers. Maintaining per-connection state allows 9P to be used for resources with sophisticated access control poli- cies, such as exclusive-use lock files and chat session multiplexers. It also makes servers easier to imple- ment, since they can forget about file ids once a connection is lost. The benefits of having a stateful protocol come with one important drawback: when the network con- nection is lost, reestablishing that state is not a completely trivial operation. Most 9P clients, including the Plan 9 kernel, do not plan for the loss of a file server connection. If a program loses a connection to its file server, the connection can be remounted and the program restarted. If the kernel loses the connection to its root file server, the machine can be rebooted. These heavy-handed solutions are only appropriate when connections fail infrequently. In a large system with many connections, or in a system with wide-area net- work connections, it becomes necessary to handle connection failures in a more graceful manner than restarting the server, especially since restarting the server might cause other connections to break. One approach would be to modify individual programs to handle the loss of their file servers. In cases where the resources have special semantics, such as exclusive-use lock files, this may be necessary to ensure that application-specific semantics and invariants are maintained. In general, however, most remote file servers serve traditional on-disk file systems. For these connections, it makes more sense to delegate the handling of connection failure to a single program, rather than need to change every client (including cat and ls). We wrote a 9P proxy called recover to handle network connection failures and to hide them from clients, so that the many programs written assuming connections never fail can continue to be used without modification. Keeping the recovery logic in a single program makes it easier to debug, modify, and even to extend. For example, in some cases it might make sense to try dialing a different file system when one fails. This paper presents the design and implementation of recover, along with performance