Optimal Placement of Replicas in Data Grid Environments with
Locality Assurance
Yi-Fang Lin Pangfeng Liu
Department of Computer Science
National Taiwan University
Taipei, Taiwan, R.O.C.
pangfeng@csie.ntu.edu.tw
Jan-Jan Wu
Institute of Information Science
Academia Sinica
Taipei, Taiwan, R.O.C.
Abstract
Data replications is a typical strategy for increas-
ing access performance and data availability in Data
Grid systems. Current work on data replication in Grid
systems focuses on infrastructure for replication and
mechanisms for creating/deleting replicas. The impor-
tant problem of choosing suitable locations for placing
replicas in Data Grids has not been well studied.
In this paper, we address the problem of data replica
placement in Data Grids given the traffic pattern and
locality requirements. We propose a new placement al-
gorithm that finds the optimal locations for the replicas
so that the workload among these replicas is balanced.
We also propose a new algorithm to decide the mini-
mum number of replicas required when the maximum
workload capacity of each replica server is known. All
these algorithms ensure that locality requirements from
the users are satisfied.
1 Introduction
Grid computing is an important mechanism for
utilizing distributed computing resources. These re-
sources are distributed in different geographical loca-
tions, but are organized to provide an integrated ser-
vice. A grid system can provide computing resources
so that users at different locations can utilize the CPU
cycles of remote sites. In addition, users can access
important data that are available only in several lo-
cations, without the overheads of replicating them lo-
cally. These services are provided by an integrated grid
service platform so that user can access the resource
transparently and effectively.
One class of grid computing and the focus of
this paper is Data Grids that provide geographically
distributed storage resources to large computational
problems that require evaluating and managing large
amount of data [3, 11, 16]. For example, the scientists
working on bioinformatics may need to access human
gnome databases on different remote locations. These
databases have tremendous amount of data, so the cost
of maintaining a local copy on each site that needs
the data is extremely expensive. In addition, these
databases are mostly read-only, since they are the input
data to the applications for various purposes, such as
benchmarking, identification, and classification. With
the high latency of wide-area network that underlies
most Grid systems, and the need to access/manage
several petabytes of data in Grid environments, data
availability and access optimization becomes key chal-
lenges to be addressed.
An important technique to speed up data access for
Data Grid systems is to replicate the data in multiple
locations, so that a user can access the data from a site
in his vicinity. It has been shown that data replication
not only reduces access costs, but also increase data
availability in many applications [11, 17, 15]. There is
a fair amount of work on data replication in Grid envi-
ronments. However, most of the existing work focused
on infrastructures for replication and mechanisms for
creating/deleting replicas [4, 7, 6, 8, 11, 15, 18, 17, 19].
We believe that, in order to obtain maximum gains
of replication, a strategic placement of the replicas is
necessary.
A number of early works address placement of data
replicas in parallel and distributed systems with reg-
ular network topologies such as hypercubes, torus,
rings, and trees. These networks posses many attrac-
tive mathematical properties that enable the design of
simple and robust placement algorithms [2, 12, 21].
These algorithms, however, cannot be directly ap-
plied to Data Grid systems due to hierarchical net-
Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS'06)
0-7695-2612-8/06 $20.00 © 2006 IEEE