Copyright 2001. Published in the Proceedings of the 26 th Annual IEEE Conference on Local Computer Networks (LCN 2001), Tampa FL. USA, November 2001. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: +Intl. 732-562-3966. Application-Level Survivable Software: rFTP Proof-of-Concept * Mary Grzywa ** Ajit Dharmik William Yurcik *** Larry Brumbaugh Illinois State University Department of Applied Computer Science {addharm,wjyurci,ljbrumb}@ilstu.edu Abstract Application-level survivability, the ability to reconfigure an application to transparently maintain services when part of a system becomes unavailable, is the most flexible and comprehensive approach to supporting mission fulfillment since it can provide assurance over all lower layers within a networked system. We have developed Resumable FTP, an application based on RFC 959, which has the ability to resume the download of a file after the download has been interrupted by users or by lower layers (loss of connection). We present the design and experimental use of rFTP and conclude with future directions for work in application-layer survivability. 1. Introduction In this paper we present a survivability paradigm that has proven useful for other domains (networking and operating systems). The basic concept is while everything possible should be attempted in the software development process to prevent application failures, it is not enough – application failures are unavoidable on most large systems and thus comprehensive support is needed for application fault recovery. Survivability is defined as the ability of a system to transparently maintain services when part of a system becomes unavailable. Application-level survivability is the ability of an application to reconfigure in order to maintain services despite failures in lower layers within a networked system. 1 The classic example of application- level survivability is the successful dual-mode cellular * This work was supported in part by grants from John Deere Corporation and State Farm Insurance Company. ** work done as a graduate student, now employed at State Farm Insurance, Email: mary.grzywa.LLR4@statefarm.com *** corresponding author; additional contact info: voice 309-827-4172, hard copy: 45 Oak Park Road, Bloomington, IL 61701 USA. 1 application connection robustness in the face of transport failures or redirection could result from an Internet session layer but the lack thereof drives this research at the application layer telephone. Dual-mode cellular telephones are configured for primary digital transmission but when out of a digital signal coverage area these telephones transparently switch to analog transmission which has ubiquitious coverage. We chose FTP as an application with generalizeable characteristics to prototype application-level survivability: (1) FTP continues to be the most common method for transferring bulk data across networks; (2) FTP signaling uses two connections (separate control and data channels) which is similar to many emerging Internet applications enabling long-lasting sessions including Voice-Over-IP architectures (H.323, SIP) and streaming multimedia; (3) FTP has well-defined documentation [2,3]; and (4) providing survivability in terms of reconfiguration is straightforward. 2. Related Work Application-level restoration provides the most flexible survivability of a layered networked-system since it can recover from failures at any lower layers. However, little attention has been paid to application-level survivability although lower layer restoration schemes have alluded to escalating recovery to the highest layer when all else fails. 2 The realization that multiple lower layers may fail even in well-designed layer-coordinated systems is best exemplified by a recent study of the alarming number of incorrect datagrams that pass link-level CRC error detection and yet fail transport-level checksum error. 3 [5] Unfortunately, most software is not designed for fail- safe or fail-secure operation; an application failure will default to service termination and/or fail-insecure state. What is needed is compartmentalization to limit software faults and attackers ability to do damage analogous to water-tight chambers in ships – an FTP Guard.[4] A recent wide-spread FTP security flaw (globbing) has only 2 Many real-time applications would be better served by lower layer restoration due to speed that may be hard to attain at higher layers. 3 It is estimated in [5] that between one packet in a few million to one packet in ten billion has an error that goes undetected at the link-level and transport-level.