1 Recovering from Distributable Thread Failures in Distributed Real-Time Java Edward Curley , Binoy Ravindran , Jonathan Anderson , and E. Douglas Jensen ‡ ECE Dept., Virginia Tech Blacksburg, VA 24061, USA {alias,binoy,andersoj}@vt.edu ‡ The MITRE Corporation Bedford, MA 01730, USA jensen@mitre.org Abstract We consider the problem of recovering from failures of distributable threads (“threads”) in distributed real- time systems that operate under run-time uncertainties including those on thread execution times, thread arrivals, and node failure occurrences. When a thread experiences a node failure, the result is broken thread having an orphan. Under a termination model, the orphans must be detected and aborted, and exceptions must be delivered to the farthest, contiguous surviving thread segment for resuming thread execution. Our application/scheduling model includes the proposed distributable thread programming model for the emerging Distributed Real-Time Specification for Java (DRTSJ), together with an exception handler model. Threads are subject to time/utility function (TUF) time constraints and an utility accrual (UA) optimality criterion. A key underpinning of the TUF/UA scheduling paradigm is the notion of “best-effort” where higher importance threads are always favored over lower importance ones, irrespective of thread urgency as specified by their time constraints. We present a thread scheduling algorithm called HUA and a thread integrity protocol called TPR. We show that HUA and TPR bound the orphan cleanup and recovery time with bounded loss of the best-effort property. Our implementation experience of HUA/TPR in the Reference Implementation of the proposed programming model for the DRTSJ demonstrates the algorithm/protocol’s effectiveness. Index Terms distributable thread, thread integrity, time/utility function, utility accrual scheduling, distributed real-time Java, scheduling, real-time I. I NTRODUCTION Some distributed system applications (or portions of applications) are most naturally structured as a multiplicity of causally-dependent, flows of execution within and among objects, asynchronously and concurrently. The causal flow of execution can be a sequence—e.g., one that is caused by a series of nested, remote method invocations. It can also be caused by a series of chained, publication and subscription events, caused due to topical data dependencies—e.g., publication of topic A depends on subscription to topic B; B’s publication, in turn, depends on subscription to topic C , and so on. Since partial failures are the common case rather than the exception in some distributed systems, applications typically desire the causal, multi-node execution flow abstraction to exhibit application-specific, end-to-end integrity properties — one of the most important raisons d etre for building distributed systems. Real-time distributed applications also require end-to-end timeliness properties for the abstraction. An abstraction for programming multi-node sequential behaviors and for enforcing end-to-end properties is distributable threads [1], [2]. Distributable threads first appeared in the Alpha OS [2], and later in Mach 3.0 [3] (a subset), and MK7.3 [4]. They constitute the first-class programming and scheduling abstraction for multi-node sequential behaviors in Real-Time CORBA 2 [5] and are proposed for Sun’s emerging Distributed Real-Time Specification for Java (DRTSJ) [1]. In the rest of the paper, we will refer to distributable threads as threads, unless qualified. A thread is a single logically distinct (i.e., having a globally unique identity) locus of control flow movement that extends and retracts through local and (potentially) remote objects. The objects in the distributable thread model are passive (as opposed to active objects that encapsulate one or more local threads). An object instance resides on a single computational node. A distributable thread enters an