A Scalable Autonomous Replica Management Framework for Grids Vladimir Vlassov, Dong Li Dept of Electronic, Computer and Software Systems, School for Information and Communication Technology, Royal Institute of Technology (KTH) Stockholm, Sweden vladv @kth.se, dongl@ kth.se Konstantin Popov Swedish Institute of Computer Science (SICS) Kista, Sweden kost@sics.se Seif Haridi Dept of Electronic, Computer and Software Systems, School for Information and Communication Technology, Royal Institute of Technology (KTH) Stockholm, Sweden seif@kth.se Abstract Data replication can reduce access time and improve fault tolerance and load balancing. Typical requirements for a replica management system include an upper bound on replica Round Trip Time, scalability, reliability, self-management and self- organization, and ability to maintain consistency of mutable replicas. This article presents the design and a prototype implementation of a scalable, autonomous, service-oriented replica management framework for Globus Toolkit Version 4 using DKS. DKS is a structured peer-to-peer middleware. Grid nodes are integrated into a P2P network. The framework uses the ant metaphor and techniques of multi-agent systems for collaborative replica selection. We propose also a complimentary “background” service that collects access statistics and optimizes replica placement based on access pattern and replica lifetimes statistics. We have tested and profiled the prototype. 1. Introduction The Grid aims at secure integration of heterogeneous computing resources in a standard and uniform way. Sharing of computing facilities is provided by Computational Grids, and Data Grids support applications addressing large amount of data that can be distributed and replicated over the Globe. Managing storage facilities and data replicas in large scale and volatile Data Grids is a challenging task. A replica management system determines locations of replicas and keeps track of them, and maintains replica consistency. It should be scalable and optimize replica placement in order to reduce access time and This research work is carried out under the FP6 Network of Excellence CoreGRID funded by the European Commission (Contract IST-2002-004265). communication costs, and maintain a required level of consistency of mutable replicas [1]. Typical replica management systems (e.g. [2][3]) consist of: - Information services such as the replica location service (RLS), user access history, and a metadata catalog. Some services, in particular RLS, must provide good scalability and fault-tolerance. - Data transfer service provides for file transfer among collaborating Grid nodes. - Security infrastructure, including authentication and authorization for remote users, and secure communication and data transfer. - Data consistency management service can have different requirements depending on the application and data access patterns. In many scientific applications the datasets are read-only and therefore replicas are consistent automatically. Applications with mutable data should be allowed to choose an optimum level of replica consistency. Our research aims at building a scalable higher- level replica management system that provides optimized replica selection and placement. Our system prototype is implemented using the Globus Toolkit 4 (GT4) [4]. Our framework utilizes the DKS (Distributed K-ary System) Peer-to-Peer middleware [5][6] that allows to organize Grid nodes into self- organizing P2P overlay networks. The GT4 GridFTP [7] is used for data transfer. Our framework contains two specially developed components: the DKS Replica Location Service (DKSRLS) built on the DKS P2P network, and the Node Location Component (NLC) based on GT4 WS Monitor and Discovery Systems (MDS) aggregator framework. The framework also provides for a data consistency mechanism. The dynamicity of the Grid environment presents challenges for replica management. In order o simplify the work of Grid users and administrators on replica management, consistency maintenance, and to improve reliability, a replica management system must be IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing (JVA'06) 0-7695-2643-8/06 $20.00 © 2006