SAMGrid Peer-to-Peer Information Service Matthew Leslie 1,2 , Siniˇ sa Veseli 2 1 Oxford University Computing Laboratory 2 Fermi National Accelerator Laboratory m.leslie1@physics.ox.ac.uk, veseli@fnal.gov Abstract SAMGrid presently relies on the centralised database for pro- viding several services vital for the system operation. These ser- vices are all encapsulated in the SAMGrid Database Server, and include access to file metadata and replica catalogs, dataset and processing bookkeeping, as well as the runtime support for the SAMGrid station services. Access to the centralised database and DB Servers represents a single point of failure in the system and limits its scalability. In order to address this issue, we have created a prototype of a peer-to-peer information service that allows the system to operate during times when access to the central DB is not available for any reason (e.g., network failures, scheduled downtimes, etc.), as well as to improve the system performance during times of extremely high system load when the central DB access is slow and/or has a high failure rate. Our prototype uses Distributed Hash Tables to create a fault tolerant and self-healing service. We believe that this is the first peer-to-peer information service designed to become a part of an in-use grid system. We describe here the prototype architecture and its existing and planned functionality, as well as show how it can be integrated into the SAMGrid system. We also present a study of perfor- mance of our new service under different circumstances. Our re- sults strongly demonstrate the feasibility and usefulness of the proposed architecture. INTRODUCTION The high energy physics community places stringent de- mands on its data handling systems. Experiments such as MINOS, and the D0[2] and CDF[1] detectors at Fermi- lab generate petabytes of data which must be stored and made available to physicists for analysis[3]. This is made more challenging by the international nature of the collab- orations that analyse this data. The CDF experiment, for instance, has collaborators in 11 countries on three conti- nents. To meet these demands, the SAM-Grid[4] system has evolved to be both robust and fault tolerant. However, like many grid systems, it relies heavily on central services. While some of these services can easily be configured to failover to (possibly off-site) backups, no such possibility exists for the central database, which stores all informa- tion about the SAM-Grid system. This reliance on a sin- gle database creates two problems, a load bottleneck which limits scalability, and a single point of failure, which limits failure tolerance. Here we describe efforts to reduce this dependency through deploying a scalable and fault tolerant peer to peer information service. In sections and we give a brief overview of the existing SAM-Grid information service, and describe why we feel a peer to peer replacement is appropriate. We describe re- cent advances in peer to peer software that power our new system in section , and how we incorporate them into SAM- Grid in section . In section , we investigate the performance of our implementation of this architecture. Finally, in sec- tion , we discuss the context of this work and offer conclud- ing remarks. EXISTING SAM-GRID ARCHITECTURE The SAM-Grid system offers a wide variety of ser- vices for data transfer, cataloguing, data storage and pro- cess bookkeeping in a distributed environment. SAM-Grid users can create datasets of physics data files based on metadata attributes, then use the SAM system to manage the delivery and processing of these files, and finally the storage of the results. Two of the main system components are the Station and the DBServer. It is the station that requests and logs the delivery of files to user projects, recording which files are stored on which disks, and managing storage space. To record and retrieve this data from a persistent store, the station uses CORBA method calls to communicate with the DBServer. All SAM-Grid information is stored in the central database, and so the DBServer must translate these requests into SQL and pass them on. The results the database returns are then processed and returned to the station. The DBServer hides stations from the underlying database schema, and provides a level of indirection be- tween the station and the database that we have exploited in our information service architecture. MOTIVATION Although it is possible to run more than one DBServer, the SAM-Grid design does not allow for more than one database. This limits both the scalability and the fault toler- ance of the entire system. Though the Oracle database has proven reliable, network outages can still bring all off-site processing to a halt. As eighty percent of the 50 stations currently running as part of the D0 experiment are hosted