J. Parallel Distrib. Comput. 65 (2005) 1318 – 1328
www.elsevier.com/locate/jpdc
A path selection-based algorithm for real-time data staging
in Grid applications
Mohammed Eltayeb
a , ∗
, Atakan Do ˘ gan
b
, Füsun Özgüner
a
a
Department of Electrical Engineering, The Ohio State University, 2015 Neil Avenue, Columbus, OH 43210, USA
b
Department of Electrical and Electronics Engineering, Anadolu University, 26470 Eski¸ sehir, Turkey
Received 5 January 2004; accepted 11 May 2005
Abstract
Efficient data scheduling is becoming an important issue in distributed real-time applications that produce huge data sets. The Grid
environment on which these applications may run seeks to harness the geographically distributed resources for the applications. Scheduling
components should account for real-time measures of the applications and reduce communication overhead due to enormous data size
experienced, especially in dissemination applications. In this study, we consider the data staging scheme to provide the dissemination of
large-scale data sets for the distributed real-time applications. We propose a new path selection-based algorithm for optimizing a criterion
that reflects the general satisfiability of the system. The algorithm adopts a blocking-time analysis method combined with a simple heuristic
to explore the most likely regions of a search space. Two heuristics are provided for the algorithm to explore these regions of the search
space. Simulation results show that the proposed algorithm together with either of the heuristic has higher performance compared to other
algorithms in the literature. We also show by simulation that a new optimization criterion we proposed in this study is successful in
improving the performance of the individual applications.
© 2005 Elsevier Inc. All rights reserved.
Keywords: Data-intensive and real-time applications; Data staging; Data scheduling; Blocking analysis; Concurrency in scheduling
1. Introduction
A wide range of newly emerging large-scale distributed
applications in Grid is now drifting towards the operation
under real-time constraints. The success of these applica-
tions does not depend only on the successful completion
of their massive tasks, but also on meeting specific pre-
assigned deadlines or time constraints [12,7]. Moreover, the
vast amount of data produced and processed by these ap-
plications constitutes a great challenge on the distributed
or Grid infrastructure. In some cases, the data generated
can reach such formidable sizes that the underlying dis-
tributed environment fails to satisfy its quality of service or
∗
Corresponding author. Fax: +1 614 292 7596.
E-mail addresses: eltayeb@ee.eng.ohio-state.edu (M. Eltayeb),
atdogan@anadolu.edu.tr (A. Do˘ gan), ozguner@ee.eng.ohio-state.edu
(F. Özgüner).
0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2005.05.027
real-time demands [6], which can significantly degrade the
performance of the application. One solution is to continu-
ously parallel the increase of demands on the communica-
tion with the expansion of the network capacity. However,
such a solution can be prohibitively expensive or imprac-
tical due to the rapid increase in the size of emerging dis-
tributed applications [11,2]. A more subtle solution resorts
to efficient and effective resource management and schedul-
ing techniques to allow multi-class services co-exist in such
environments.
Let us consider the example of Fig. 1. Assume a dis-
tributed industrial vision system that provides complex and
sensitive inspection for significant industrial production lines
[13]. Vision equipments (VE) provide online huge images of
the product, which needs to be analyzed, matched, verified
and stored in real-time fashion. For a huge industrial system
composed of hundreds and maybe thousands of partici-
pants, Grid provides an efficient computation environment