Data Replication Approach with Consistency
Guarantee for Data Grid
Jemal H. Abawajy, Senior Member, IEEE, and Mustafa Mat Deris, Member, IEEE
Abstract—Data grids have been adopted by many scientific communities that need to share, access, transport, process, and manage
geographically distributed large data collections. Data replication is one of the main mechanisms used in data grids whereby identical
copies of data are generated and stored at various distributed sites to either improve data access performance or reliability or both.
However, when data updates are allowed, it is a great challenge to simultaneously improve performance and reliability while ensuring data
consistency of such huge and widely distributed data. In this paper, we address this problem. We propose a new quorum-based data
replication protocol with the objectives of minimizing the data update cost, providing high availability and data consistency. We compare
the proposed approach with two existing approaches using response time, data consistency, data availability, and communication costs.
The results show that the proposed approach performs substantially better than the benchmark approaches.
Index Terms—Data grid, data replication, big data, reliability, availability, data constancy
1 INTRODUCTION
I
N an emerging class of data-intensive scientific and com-
mercial applications such as high energy particle physics
and astronomy [3], large amounts of data sets may be gener-
ated, accessed and shared, from different locations with
varied quality of service requirements. The sheer volume of
data involved makes efficient data management an important
and challenging problem. Data grids such as the Large Had-
ron Collider (LHC) [3], the Enabling Grids for E-SciencE
project (EGEE) [2] and EU data grid project (EGI) [1] have
been developed to address this data management challenges.
However, management of widely distributed huge data gives
rise to many design issues such as fast and reliable access,
access permissions, data consistency, and security [32].
One practical way to address the problem of fast and reliable
data access is to use data replication strategy in which multiple
copies of the data are stored at multiple remote sites. It has been
shown that using a simple data replication provides substantial
performance improvements as compared to the case where no
data replication is used [14]. Although data replication tech-
niques have been widely studied in traditional distributed and
database systems (e.g., [9] and [14]), the scale and complexity of
applications and distributed computing architectures have
changed drastically and so has replication protocols. Given
that the utility of many current network services is limited by
availability rather than raw performance, the problem of data
replication for improved performance and data availability is
of paramount importance in data grids.
Although data replication for data grids is gaining mo-
mentum, the primary objectives of exiting research is mainly
focused on reducing the data access latency by maintaining
replicas of a file in each data grid site. However, maintaining
replicas of a file in each site requires large storage and network
resources. Moreover, the algorithms for selection of candidate
sites to place replicas and for maintaining data consistency in
data grids are crucial to the success of the data replication
approaches [21]. Unfortunately, most of the existing data grid
replication schemes do not consider data updates, which
make them inappropriate for applications such as collabora-
tive environments [26]. When data updates are allowed,
managing data access activities is very important in order to
preserve data consistency and reliability of the systems. Thus,
determining the number of replicas and the appropriate
locations to store the data replicas for performance and
availability while ensuring data consistency are major issues
to be addressed in data grids.
In this paper, we formulate the data replication problem
and design a distributed data replication algorithm with
consistency guarantee for data grid. The approach consists
of systematically organizing the data grid sites into distinct
regions, a new replica placement policy and a new quorum-
based replica management policy. The quorum serves as basic
tools for providing a uniform and reliable way to achieve
consistency among the replicas of the system. The main
advantage of quorum-based replication protocols is their
resilience to node and network failures. This is because any
quorum with fully operational nodes can grant read and write
permissions, improving the system’s availability. In summa-
ry, we make the following main contributions:
1) A replica placement policy, which determines how
many replicas to create and where to place the replicas;
2) A replica consistency control policy, which determines
the level of consistency among data replicas;
3) Investigate various tradeoffs in terms of cost, availability
and algorithm complexity of the proposed replication
scheme; and
•
J.H. Abawajy is with Deakin University, Geelong, Victoria 3220, Australia.
E-mail: jemal@deakin.edu.au.
•
M.M. Deris is with the Universiti Tun Hussein Onn, Batu Pahat 86400,
Johor, Malaysia. E-mail: mmustafa@uthm.edu.my.
Manuscript received 03 Apr. 2012; revised 02 June 2013; accepted 02 Sep. 2013.
Date of publication 12 Sep. 2013; date of current version 12 Nov. 2014.
Recommended for acceptance by S. Ranka.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TC.2013.183
IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014 2975
0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.