The External Recovery Problem Arkadiusz Danilecki, Mateusz Hołenko, Anna Kobusińska, and Piotr Zierhoffer Institute of Computing Science Poznań University of Technology, Poland {adanilecki,akobusinska,mholenko}@cs.put.poznan.pl Abstract. We consider an external recovery problem, where a system is divided into autonomous subsystems which can be recovered only by the means of logging the messages exchanged between the subsystems. The question follows: what restrictions to the subsystem’s autonomy are required to make the external recovery possible? We present example solutions affecting different aspects of system’s independence. Keywords: Message logging, fault tolerance, checkpointing, distributed system. 1 Introduction The probability of a node crash in a modern, large-scale computing systems, con- sisting of hundreds of thousands of nodes, comes near certainty. One approach is to divide the system into subsystems, and to isolate the crash effects within a subsystem where the crash occurred. Then, a coordinated checkpointing can be used within a subsystem [13], while to prevent crash effects from spread- ing, the messages exchanged with processes from different subsystems could be logged in a pessimistic manner. An interesting theoretical question arises: under which conditions a subsystem could be recovered only by logging the messages exchanged with other subsystems – by what we call an external recovery. There is an unspoken assumption that all parts of the system are under control of one organization, that they cooperate freely and that they expose all informa- tion necessary for the recovery. These assumptions may not hold in the future, when subsystems may be more independent. Future cooperating components involved in distributed computation may be unwilling to restrict their indepen- dence by e.g. revealing the information commonly assumed to be available for the message logging protocols. Nevertheless, if the subsystem is to be recovered using external message logging, it can’t completely retain its independence.This observation spurred the question: What must be minimally known about a system and what minimal restrictions must be imposed on a system behavior, in order to make the external recovery possible? This paper is a first step in the direction of solving this puzzle, by identifying the problem, the possible trade-offs, and by presenting two example approaches This work was supported by the Polish National Science Center under Grant No. DEC-2011/03/D/ST6/01331. L. Lopes et al. (Eds.): Euro-Par 2014 Workshops, Part I, LNCS 8805, pp. 535–546, 2014. c Springer International Publishing Switzerland 2014