Optimal Placement of Replicas in Data Grid Environments with Locality Assurance Yi-Fang Lin Pangfeng Liu Department of Computer Science National Taiwan University Taipei, Taiwan, R.O.C. pangfeng@csie.ntu.edu.tw Jan-Jan Wu Institute of Information Science Academia Sinica Taipei, Taiwan, R.O.C. Abstract Data replications is a typical strategy for increas- ing access performance and data availability in Data Grid systems. Current work on data replication in Grid systems focuses on infrastructure for replication and mechanisms for creating/deleting replicas. The impor- tant problem of choosing suitable locations for placing replicas in Data Grids has not been well studied. In this paper, we address the problem of data replica placement in Data Grids given the traffic pattern and locality requirements. We propose a new placement al- gorithm that finds the optimal locations for the replicas so that the workload among these replicas is balanced. We also propose a new algorithm to decide the mini- mum number of replicas required when the maximum workload capacity of each replica server is known. All these algorithms ensure that locality requirements from the users are satisfied. 1 Introduction Grid computing is an important mechanism for utilizing distributed computing resources. These re- sources are distributed in different geographical loca- tions, but are organized to provide an integrated ser- vice. A grid system can provide computing resources so that users at different locations can utilize the CPU cycles of remote sites. In addition, users can access important data that are available only in several lo- cations, without the overheads of replicating them lo- cally. These services are provided by an integrated grid service platform so that user can access the resource transparently and effectively. One class of grid computing and the focus of this paper is Data Grids that provide geographically distributed storage resources to large computational problems that require evaluating and managing large amount of data [3, 11, 16]. For example, the scientists working on bioinformatics may need to access human gnome databases on different remote locations. These databases have tremendous amount of data, so the cost of maintaining a local copy on each site that needs the data is extremely expensive. In addition, these databases are mostly read-only, since they are the input data to the applications for various purposes, such as benchmarking, identification, and classification. With the high latency of wide-area network that underlies most Grid systems, and the need to access/manage several petabytes of data in Grid environments, data availability and access optimization becomes key chal- lenges to be addressed. An important technique to speed up data access for Data Grid systems is to replicate the data in multiple locations, so that a user can access the data from a site in his vicinity. It has been shown that data replication not only reduces access costs, but also increase data availability in many applications [11, 17, 15]. There is a fair amount of work on data replication in Grid envi- ronments. However, most of the existing work focused on infrastructures for replication and mechanisms for creating/deleting replicas [4, 7, 6, 8, 11, 15, 18, 17, 19]. We believe that, in order to obtain maximum gains of replication, a strategic placement of the replicas is necessary. A number of early works address placement of data replicas in parallel and distributed systems with reg- ular network topologies such as hypercubes, torus, rings, and trees. These networks posses many attrac- tive mathematical properties that enable the design of simple and robust placement algorithms [2, 12, 21]. These algorithms, however, cannot be directly ap- plied to Data Grid systems due to hierarchical net- Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS'06) 0-7695-2612-8/06 $20.00 © 2006 IEEE