Load Balancing in the Presence of Random Node Failure and Recovery Sagar Dhakal 1 , Majeed M. Hayat 1 , Jorge E. Pezoa 1 , Chaouki T. Abdallah 1 , J. Doug Birdwell 2 , and John Chiasson 2 1 University of New Mexico 2 University of Tennessee Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Albuquerque, NM 87131-0001 USA Knoxville, TN 37996-2100 USA {dhakal, hayat, jpezoa, chaouki}@eece.unm.edu {birdwell, chiasson}@utk.edu Abstract In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynam- ically changing in a random fashion. A load-balancing (LB) policy for such systems should therefore be robust, in terms of workload re-allocation and effectiveness in task completion, with respect to the random absence and re-emergence of nodes as well as random delays in the transfer of workloads among nodes. In this paper two LB policies for such computing environments are presented: The first policy takes an initial LB action to preemptively counteract the consequences of random failure and recovery of nodes. The second policy com- pensates for the occurrence of node failure dynamically by transferring loads only at the actual failure instants. A probabilistic model, based on the concept of regener- ative processes, is presented to assess the overall per- formance of the system under these policies. Optimal performance of both policies is evaluated using analyt- ical, experimental and simulation-based results. The interplay between node-failure/recovery rates and the mean load-transfer delay are highlighted. 1. Introduction In a distributed computing system, large work- loads are divided among independent computational elements (CEs), or nodes, in an attempt to minimize the average service time per task of the entire system. Such load allocation is referred in the literature as load balancing (LB). In a heterogeneous computing environ- ment, where different nodes (links) may have different processing speeds (delays), an effective LB policy must consider factors like inhomogeneity in node’s process- ing speeds, variability and inhomogeneity in delays in inter-node communications, the number of available nodes in the system, etc. Additionally, a distributed computing system may utilize dynamic sets of CEs, where nodes may join and leave the system in a random fashion. An example of such systems is “SETI at Home” [1]. Such systems typically use dedicated workstations as well as dynamic resources comprising a network of non-dedicated nodes, such as a collection of desk-tops or portable computing devices that are online, which are used remotely, upon availability, to participate in the distributed comput- ing. However, these nodes can go off-line anytime, re- gardless of the portion of the load assigned to them. Furthermore, the participation of any node may be in- terrupted by either the local usage of the node by its owner or due to the occurrence of physical failure or damage to the node. (The latter effect applies even to the set of dedicated nodes.) Such scenarios induce an uncertainty in the availability of the number of func- tional nodes, whereby any node (including the dedi- cated nodes) may randomly fluctuate between a “fail- ure” (or “down”) and “working” (or “up”) states. Clearly, uncertainty in the number of working nodes is expected to degrade the performance of any LB pol- icy which does not account for the above-described node-failure and recovery mechanism. More precisely, the distribution of task completion time (or service time) is dependent on the statistics of node failure and recovery. Available literature on distributed comput- ing in such uncertain environments, primarily consid- ers LB policies where any node failure is addressed only after its occurrence. Checkpoint-resume or terminate- 1-4244-0054-6/06/$20.00 ©2006 IEEE