DICE 1 Management of Very Large Distributed Shared Collections Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, MC-0505 La Jolla, CA 92093-0505 858-534-5073 moore@sdsc.edu Keywords: Data grids, persistent archives, data virtualization, management virtualization ABSTRACT Large scientific collections may be managed as data grids for sharing data, digital libraries for publishing data, persistent archives for preserving data, or as real-time data repositories for sensor data. Despite the multiple types of data management objectives, it is possible to build each system from generic software infrastructure. This article examines the requirements driving the management of large data collections, the concepts on which current data management systems are based, and the current research initiatives for managing distributed data collections. INTRODUCTION Scientific data collections are being assembled that contain the digital holdings on which future research is based. The collections are assembled by researchers from multiple institutions, and then accessed by all members of a scientific discipline. The data collections are massive in size, comprising hundreds of terabytes of data (a terabyte is a thousand gigabytes) and tens of millions of files. The software infrastructure that manages these collections must provide not only traditional digital library services, such as indexing, discovery, and presentation, but also preservation services to ensure authenticity and integrity. The types of material in the collections range from digital simulation output generated by scientific applications, to observational data taken by experiments, to real-time sensor data streams from thousands of sensors. Thus the management of scientific data collections requires the integration of capabilities from multiple disparate communities: data grids for sharing data, digital libraries for publishing data, persistent archives for preserving data, and real-time sensor systems for automating the creation of collections. The challenge is made more difficult by the fact that large data collections are inherently distributed. The collections may reside on multiple storage systems with a copy on disk for interactive access and a backup copy on tape for long-term preservation. Their assembly may involve collaborators from multiple institutions, with both the sources of the collection coming from multiple sites and the users of the collection located at multiple sites. To effectively mitigate risk of data loss, data collections are distributed across multiple types of storage systems located at geographically separated locations. All of these reasons force scientific data collections to build upon software systems that are capable of managing distributed data.