Dynamic Load Balancing in Distributed Systems in the Presence of Delays: A Regeneration-Theory Approach Sagar Dhakal, Majeed M. Hayat, Senior Member, IEEE, Jorge E. Pezoa, Cundong Yang, and David A. Bader, Senior Member, IEEE Abstract—A regeneration-theory approach is undertaken to analytically characterize the average overall completion time in a distributed system. The approach considers the heterogeneity in the processing rates of the nodes as well as the randomness in the delays imposed by the communication medium. The optimal one-shot load balancing policy is developed and subsequently extended to develop an autonomous and distributed load-balancing policy that can dynamically reallocate incoming external loads at each node. This adaptive and dynamic load balancing policy is implemented and evaluated in a two-node distributed system. The performance of the proposed dynamic load-balancing policy is compared to that of static policies as well as existing dynamic load-balancing policies by considering the average completion time per task and the system processing rate in the presence of random arrivals of the external loads. Index Terms—Renewal theory, queuing theory, distributed computing, dynamic load balancing. Ç 1 INTRODUCTION T HE computing power of any distributed system can be realized by allowing its constituent computational elements (CEs), or nodes, to work cooperatively so that large loads are allocated among them in a fair and effective manner. Any strategy for load distribution among CEs is called load balancing (LB). An effective LB policy ensures optimal use of the distributed resources whereby no CE remains in an idle state while any other CE is being utilized. In many of today’s distributed-computing environments, the CEs are linked by a delay-limited and bandwidth- limited communication medium that inherently inflicts tangible delays on internode communications and load exchange. Examples include distributed systems over wireless local-area networks (WLANs) as well as clusters of geographically distant CEs connected over the Internet, such as PlanetLab [1]. Although the majority of LB policies developed heretofore take account of such time delays [2], [3], [4], [5], [6], they are predicated on the assumption that delays are deterministic. In actuality, delays are random in such communication media, especially in the case of WLANs. This is attributable to uncertainties associated with the amount of traffic, congestion, and other unpre- dictable factors within the network. Furthermore, unknown characteristics (e.g., type of application and load size) of the incoming loads cause the CEs to exhibit fluctuations in runtime processing speeds. Earlier work by our group has shown that LB policies that do not account for the delay randomness may perform poorly in practical distributed- computing settings where random delays are present [7]. For example, if nodes have dated, inaccurate information about the state of other nodes, due to random communica- tion delays between nodes, then this could result in unnecessary periodic exchange of loads among them. Consequently, certain nodes may become idle while loads are in transit, a condition that would result in prolonging the total completion time of a load. Generally, the performance of LB in delay-infested environments depends upon the selection of balancing instants as well as the level of load-exchange allowed between nodes. For example, if the network delay is negligible within the context of a certain application, the best performance is achieved by allowing every node to send all its excess load (e.g., relative to the average load per node in the system) to less-occupied nodes. On the other hand, in the extreme case for which the network delays are excessively large, it would be more prudent to reduce the amount of load exchange so as to avoid time wasted while loads are in transit. Clearly, in a practical delay-limited distributed-computing setting, the amount of load to be exchanged lies between these two extremes and the amount of load-transfer has to be carefully chosen. A commonly used parameter that serves to control the intensity of load balancing is the LB gain. In our earlier work [7], [8], we have shown that, for distributed systems with realistic random communication delays, limiting the number of balancing instants and optimizing the performance over the choice of the balancing times as well as the LB gain at each balancing instant can result in significant improvement in computing efficiency. This motivated us to look into the so-called one-shot LB strategy. In particular, once nodes are initially assigned a certain number of tasks, all nodes would together execute LB only at one prescribed instant [8]. Monte Carlo studies and real-time experiments conducted over WLAN con- firmed our notion that, for a given initial load and average IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 4, APRIL 2007 485 . S. Dhakal, M.M. Hayat, J.E. Pezoa, and C. Yang are with the Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131-0001. E-mail: {dhakal, hayat, jpezoa, cundongyang}@eece.unm.edu. . D.A. Bader is with the College of Computing, Georgia Institute of Technology, Atlanta, GA 30332. E-mail: bader@cc.gatech.edu. Manuscript received 17 Dec. 2005; revised 27 June 2006; accepted 6 July 2006; published online 9 Jan. 2007. Recommended for acceptance by R. Thakur. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-0508-1205. Digital Object Identifier no. 10.1109/TPDS.2007.1007. 1045-9219/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society