Load Balancing in the Presence of Random Node Failure and
Recovery
Sagar Dhakal
1
, Majeed M. Hayat
1
, Jorge E. Pezoa
1
, Chaouki T. Abdallah
1
,
J. Doug Birdwell
2
, and John Chiasson
2
1
University of New Mexico
2
University of Tennessee
Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering
Albuquerque, NM 87131-0001 USA Knoxville, TN 37996-2100 USA
{dhakal, hayat, jpezoa, chaouki}@eece.unm.edu {birdwell, chiasson}@utk.edu
Abstract
In many distributed computing systems that are
prone to either induced or spontaneous node failures,
the number of available computing resources is dynam-
ically changing in a random fashion. A load-balancing
(LB) policy for such systems should therefore be robust,
in terms of workload re-allocation and effectiveness in
task completion, with respect to the random absence
and re-emergence of nodes as well as random delays in
the transfer of workloads among nodes. In this paper
two LB policies for such computing environments are
presented: The first policy takes an initial LB action
to preemptively counteract the consequences of random
failure and recovery of nodes. The second policy com-
pensates for the occurrence of node failure dynamically
by transferring loads only at the actual failure instants.
A probabilistic model, based on the concept of regener-
ative processes, is presented to assess the overall per-
formance of the system under these policies. Optimal
performance of both policies is evaluated using analyt-
ical, experimental and simulation-based results. The
interplay between node-failure/recovery rates and the
mean load-transfer delay are highlighted.
1. Introduction
In a distributed computing system, large work-
loads are divided among independent computational
elements (CEs), or nodes, in an attempt to minimize
the average service time per task of the entire system.
Such load allocation is referred in the literature as load
balancing (LB). In a heterogeneous computing environ-
ment, where different nodes (links) may have different
processing speeds (delays), an effective LB policy must
consider factors like inhomogeneity in node’s process-
ing speeds, variability and inhomogeneity in delays in
inter-node communications, the number of available
nodes in the system, etc.
Additionally, a distributed computing system may
utilize dynamic sets of CEs, where nodes may join and
leave the system in a random fashion. An example
of such systems is “SETI at Home” [1]. Such systems
typically use dedicated workstations as well as dynamic
resources comprising a network of non-dedicated nodes,
such as a collection of desk-tops or portable computing
devices that are online, which are used remotely, upon
availability, to participate in the distributed comput-
ing. However, these nodes can go off-line anytime, re-
gardless of the portion of the load assigned to them.
Furthermore, the participation of any node may be in-
terrupted by either the local usage of the node by its
owner or due to the occurrence of physical failure or
damage to the node. (The latter effect applies even to
the set of dedicated nodes.) Such scenarios induce an
uncertainty in the availability of the number of func-
tional nodes, whereby any node (including the dedi-
cated nodes) may randomly fluctuate between a “fail-
ure” (or “down”) and “working” (or “up”) states.
Clearly, uncertainty in the number of working nodes
is expected to degrade the performance of any LB pol-
icy which does not account for the above-described
node-failure and recovery mechanism. More precisely,
the distribution of task completion time (or service
time) is dependent on the statistics of node failure and
recovery. Available literature on distributed comput-
ing in such uncertain environments, primarily consid-
ers LB policies where any node failure is addressed only
after its occurrence. Checkpoint-resume or terminate-
1-4244-0054-6/06/$20.00 ©2006 IEEE