J. Parallel Distrib. Comput. 65 (2005) 1318 – 1328 www.elsevier.com/locate/jpdc A path selection-based algorithm for real-time data staging in Grid applications Mohammed Eltayeb a , , Atakan Do ˘ gan b , Füsun Özgüner a a Department of Electrical Engineering, The Ohio State University, 2015 Neil Avenue, Columbus, OH 43210, USA b Department of Electrical and Electronics Engineering, Anadolu University, 26470 Eski¸ sehir, Turkey Received 5 January 2004; accepted 11 May 2005 Abstract Efficient data scheduling is becoming an important issue in distributed real-time applications that produce huge data sets. The Grid environment on which these applications may run seeks to harness the geographically distributed resources for the applications. Scheduling components should account for real-time measures of the applications and reduce communication overhead due to enormous data size experienced, especially in dissemination applications. In this study, we consider the data staging scheme to provide the dissemination of large-scale data sets for the distributed real-time applications. We propose a new path selection-based algorithm for optimizing a criterion that reflects the general satisfiability of the system. The algorithm adopts a blocking-time analysis method combined with a simple heuristic to explore the most likely regions of a search space. Two heuristics are provided for the algorithm to explore these regions of the search space. Simulation results show that the proposed algorithm together with either of the heuristic has higher performance compared to other algorithms in the literature. We also show by simulation that a new optimization criterion we proposed in this study is successful in improving the performance of the individual applications. © 2005 Elsevier Inc. All rights reserved. Keywords: Data-intensive and real-time applications; Data staging; Data scheduling; Blocking analysis; Concurrency in scheduling 1. Introduction A wide range of newly emerging large-scale distributed applications in Grid is now drifting towards the operation under real-time constraints. The success of these applica- tions does not depend only on the successful completion of their massive tasks, but also on meeting specific pre- assigned deadlines or time constraints [12,7]. Moreover, the vast amount of data produced and processed by these ap- plications constitutes a great challenge on the distributed or Grid infrastructure. In some cases, the data generated can reach such formidable sizes that the underlying dis- tributed environment fails to satisfy its quality of service or Corresponding author. Fax: +1 614 292 7596. E-mail addresses: eltayeb@ee.eng.ohio-state.edu (M. Eltayeb), atdogan@anadolu.edu.tr (A. Do˘ gan), ozguner@ee.eng.ohio-state.edu (F. Özgüner). 0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2005.05.027 real-time demands [6], which can significantly degrade the performance of the application. One solution is to continu- ously parallel the increase of demands on the communica- tion with the expansion of the network capacity. However, such a solution can be prohibitively expensive or imprac- tical due to the rapid increase in the size of emerging dis- tributed applications [11,2]. A more subtle solution resorts to efficient and effective resource management and schedul- ing techniques to allow multi-class services co-exist in such environments. Let us consider the example of Fig. 1. Assume a dis- tributed industrial vision system that provides complex and sensitive inspection for significant industrial production lines [13]. Vision equipments (VE) provide online huge images of the product, which needs to be analyzed, matched, verified and stored in real-time fashion. For a huge industrial system composed of hundreds and maybe thousands of partici- pants, Grid provides an efficient computation environment