Future Generation Computer Systems 23 (2007) 846–860 www.elsevier.com/locate/fgcs Job scheduling and data replication on data grids Ruay-Shiung Chang , Jih-Sheng Chang, Shin-Yi Lin Department of Computer Science and Information Engineering, National Dong Hwa University, Shoufeng, Hualien 974, Taiwan Received 25 July 2006; received in revised form 23 February 2007; accepted 27 February 2007 Available online 16 March 2007 Abstract In data grids, many distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes). Data access time depends on bandwidth, especially in a cluster grid. Network bandwidth within the same cluster is larger than across clusters. In a communication environment, the major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks (WANs) and Internet. Effective scheduling in such network architecture can reduce the amount of data transferred across the Internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism to generate multiple copies of the existing data to reduce access opportunities from a remote site. To utilize the above two concepts, in this paper we develop a job scheduling policy, called HCS (Hierarchical Cluster Scheduling), and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies in a cluster grid. We simulate our algorithm to evaluate various combinations of data access patterns. We also implement HCS and HRS in the Taiwan Unigrid environment. The simulation and experiment results show that HCS and HRS successfully reduces data access time and the amount of inter-cluster-communications in comparison with other strategies in a cluster grid. c 2007 Elsevier B.V. All rights reserved. Keywords: Data replication; Data grid; Job scheduling 1. Introduction In data grids [1,2], distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes). Managing this large amount of data in a centralized way is ineffective due to extensive access latency and load on the central server. Hence, such huge dataset must be separated and stored in several physical locations. In a communication environment, the performance of accessing a distributed and huge amount of data depends on the availability of network bandwidth. Namely, slow data access can throttle the performance of data-intensive applications running on grid computers. In Fig. 1, a simple hierarchical form of a grid system, called cluster grid, is shown. A cluster represents an organization unit which is a group of sites that are geographically close. We define two kinds of communications between sites in a cluster grid. Intra-communication is the communication between sites within the same cluster. On the other hand, inter-communication is the communication between Corresponding author. Tel.: +886 3 8632031; fax: +886 3 8632030. E-mail address: rschang@mail.ndhu.edu.tw (R.-S. Chang). sites across clusters. Network bandwidth between sites within a cluster will be larger than across clusters. Therefore, to reduce access latency and to avoid WAN bandwidth bottleneck in a cluster grid, it is important to reduce the number of inter- communications. To address this problem, we consider two aspects of inter- communication: job scheduling and replication mechanism. Consider a case that many of the authorized users submit jobs to solve data-intensive problems. We want that jobs be executed as fast as possible. The size of the data used on Data Grid is from terabytes to petabytes. Scheduling jobs to suitable grid sites is necessary because data movement between different grid sites is time consuming. The scheduling decisions should be based on the appropriate resources a grid site has. Other factors to be considered include CPU workload, features of computational capability, location of data and network load. If a job is scheduled to a site where the required data are present, the job can process data in this site without any transmission delay for getting data from a remote site. Data replication is another important optimization step to manage large data by replicating data in geographically distributed data stores. Previous replication strategies show 0167-739X/$ - see front matter c 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2007.02.008