A Two-way Strategy for Replica Placement in Data Grid Qaisar Rasool, Jianzhong Li, Ehsan Ullah Munir, and George S. Oreku School of Computer Science and Technology, Harbin Institute of Technology, China Abstract – In large Data Grid systems the main objective of replication is to enhance data availability by placing replicas at the proximity of users so that user perceived response time is minimized. For a hierarchical Data Grid, replicas are usually placed in either top-down or bottom- up way. We put forward Two-way replica placement scheme that places replicas of most popular files close to the requesting clients and less popular files a tier below from the Data Grid root. We facilitate data requests to be serviced by the sibling nodes as well as by the parent. Experiments results show the effectiveness of Two-way replica placement scheme against no replication. Keywords: Replication, Replica placement, Data Grid. 1 Introduction Grid computing [5] is a wide-area distributed computing environment that involves large-scale resource sharing among collaborations, often referred to as Virtual Organizations, of individuals or institutes located in geographically dispersed areas. Data grids [2] are grid infrastructure with specific needs to transfer and manage massive amounts of scientific data for analysis purposes. Data replication is an important technique used in distributed systems for improving data availability and fault tolerance. Replication schemes are divided into static and dynamic. While static replication is user-centered and do not support the changing behavior of the system, dynamic replication is more suitable for environments like P2P and Grid systems. In general, replication mechanism determines which files should be replicated, when to create new replicas, and where the new replicas should be placed. There are many techniques proposed in research for dynamic replication in Grid [10, 7, 11, 13]. These strategies differ by the assumptions made regarding underlying grid topology, user request patterns, dataset sizes and their distribution, and storage node capacities. Other distinctive features include data request path and the manner in which replicas are placed on the Grid nodes. Two common approaches for replica placement in a tree topology Data Grid are top-down [10, 7] and bottom-up [11]. In both cases, the root of Data Grid tree is considered as the central repository for all datasets to be replicated. For a Data Grid tree, usually clients at the leaf nodes generate data requests. A request travels from client to parent node in search of replica until it reaches at root node. We in this paper propose a Two-way replication scheme that takes a different path for data request. It is assumed that the children under the same parent in the Data Grid tree are linked in a P2P-like manner. For any client’s request, if desired data is not available at the client’s parent node, the request moves to the sibling nodes one by one until it finds the required data. If none of the siblings can fulfill the request, the request moves to the parent node one level up. Here also all the siblings are probed and if data not found the request moves to next parent and ultimately to root node. In Two-way replication scheme we use both bottom-up and top-down approaches to place the data replicas in order to enhance availability of requested data in Data Grid. The files which are more frequent are placed close to the clients and the less frequent files are placed close to the root, one tier below, in the Grid. The simulation studies show the benefit of Two-way replication strategy over the case when no replication is used. We perform experiments with data files of uniform size and with variable sizes separately. 2 Data Grid Model Several Grid activities such as [3, 8] have been launched since the early years of this century. We find that many practical Grids, for example, GriPhyN [12] employ topology which is hierarchical in nature. The High Energy Physics (HEP) community seeks to take advantage of the grid technology to provide physicists with the access to real as well as simulated LHC [8] data from their home institutes. Data replication and management is hence considered to be one the most important aspects of HEP Data Grids. In this paper we have used the hierarchical Data Grid model. A tree T is used to represent the topology of the Data Grid which is composed of root, intermediate nodes and leaf nodes. We hereafter refer the intermediate nodes as cache nodes and leaf nodes as client nodes. All client nodes are local sites issuing request for data stored at the root or cache nodes of the Data Grid. For any parent node, all its children are linked into P2P-like manner (i.e. are siblings) and can transfer replicas to each other when required. The only