Data Replication Approach with Consistency Guarantee for Data Grid Jemal H. Abawajy, Senior Member, IEEE, and Mustafa Mat Deris, Member, IEEE AbstractData grids have been adopted by many scientic communities that need to share, access, transport, process, and manage geographically distributed large data collections. Data replication is one of the main mechanisms used in data grids whereby identical copies of data are generated and stored at various distributed sites to either improve data access performance or reliability or both. However, when data updates are allowed, it is a great challenge to simultaneously improve performance and reliability while ensuring data consistency of such huge and widely distributed data. In this paper, we address this problem. We propose a new quorum-based data replication protocol with the objectives of minimizing the data update cost, providing high availability and data consistency. We compare the proposed approach with two existing approaches using response time, data consistency, data availability, and communication costs. The results show that the proposed approach performs substantially better than the benchmark approaches. Index TermsData grid, data replication, big data, reliability, availability, data constancy 1 INTRODUCTION I N an emerging class of data-intensive scientic and com- mercial applications such as high energy particle physics and astronomy [3], large amounts of data sets may be gener- ated, accessed and shared, from different locations with varied quality of service requirements. The sheer volume of data involved makes efcient data management an important and challenging problem. Data grids such as the Large Had- ron Collider (LHC) [3], the Enabling Grids for E-SciencE project (EGEE) [2] and EU data grid project (EGI) [1] have been developed to address this data management challenges. However, management of widely distributed huge data gives rise to many design issues such as fast and reliable access, access permissions, data consistency, and security [32]. One practical way to address the problem of fast and reliable data access is to use data replication strategy in which multiple copies of the data are stored at multiple remote sites. It has been shown that using a simple data replication provides substantial performance improvements as compared to the case where no data replication is used [14]. Although data replication tech- niques have been widely studied in traditional distributed and database systems (e.g., [9] and [14]), the scale and complexity of applications and distributed computing architectures have changed drastically and so has replication protocols. Given that the utility of many current network services is limited by availability rather than raw performance, the problem of data replication for improved performance and data availability is of paramount importance in data grids. Although data replication for data grids is gaining mo- mentum, the primary objectives of exiting research is mainly focused on reducing the data access latency by maintaining replicas of a le in each data grid site. However, maintaining replicas of a le in each site requires large storage and network resources. Moreover, the algorithms for selection of candidate sites to place replicas and for maintaining data consistency in data grids are crucial to the success of the data replication approaches [21]. Unfortunately, most of the existing data grid replication schemes do not consider data updates, which make them inappropriate for applications such as collabora- tive environments [26]. When data updates are allowed, managing data access activities is very important in order to preserve data consistency and reliability of the systems. Thus, determining the number of replicas and the appropriate locations to store the data replicas for performance and availability while ensuring data consistency are major issues to be addressed in data grids. In this paper, we formulate the data replication problem and design a distributed data replication algorithm with consistency guarantee for data grid. The approach consists of systematically organizing the data grid sites into distinct regions, a new replica placement policy and a new quorum- based replica management policy. The quorum serves as basic tools for providing a uniform and reliable way to achieve consistency among the replicas of the system. The main advantage of quorum-based replication protocols is their resilience to node and network failures. This is because any quorum with fully operational nodes can grant read and write permissions, improving the systems availability. In summa- ry, we make the following main contributions: 1) A replica placement policy, which determines how many replicas to create and where to place the replicas; 2) A replica consistency control policy, which determines the level of consistency among data replicas; 3) Investigate various tradeoffs in terms of cost, availability and algorithm complexity of the proposed replication scheme; and J.H. Abawajy is with Deakin University, Geelong, Victoria 3220, Australia. E-mail: jemal@deakin.edu.au. M.M. Deris is with the Universiti Tun Hussein Onn, Batu Pahat 86400, Johor, Malaysia. E-mail: mmustafa@uthm.edu.my. Manuscript received 03 Apr. 2012; revised 02 June 2013; accepted 02 Sep. 2013. Date of publication 12 Sep. 2013; date of current version 12 Nov. 2014. Recommended for acceptance by S. Ranka. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TC.2013.183 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014 2975 0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.