DICE RODS 1 A Prototype Rule-based Distributed Data Management System Arcot Rajasekar Mike Wan Reagan Moore Wayne Schroeder San Diego Supercomputer Center University of California, San Diego {sekar,mwan,moore,schroede}@sdsc.edu Abstract Current distributed data management systems implement internal consistency constraints directly within the software. If these constraints change over time, the software must be rewritten. A new approach towards data management system based on the dynamic execution of rules is discussed. By building upon the experiences gained with the Storage Resource Broker data grid, a scalable rule-based data management system can be designed. 1. Introduction Data Grids support data virtualization, the ability to manage the properties of and access to shared collections distributed across multiple storage systems. [1] The data virtualization concept maps the physical aspects of the storage systems to a logical view of the information and data that is provided to the user and applications. Shared collections are used to organize distributed data for data grids (data sharing) [2], digital libraries (data publication) [3], and persistent archives (data preservation) [4]. The current state of the art in data abstraction, as represented by the SDSC Storage Resource Broker (SRB) [5] architecture, relies upon the following multiple levels of virtualization: Storage access virtualization – this level of virtualization provides a single uniform interface for mapping between access APIs and the protocols used by different types of storage systems (e. g. Unix file systems, Windows FAT32, and other custom-created file systems such as SamQFS). Distributed file systems usually support a limited number of operating system types: IBM’s GPFS on Unix [6], Gfarm on Linux systems [7]. Data grids provide uniform access to data stored in file systems, databases, archival storage systems, file transfer protocols, sensor data loggers, and object ring buffers [8]. Naming virtualization – this level provides naming transparency for shared collections including logical data set naming, logical resource naming, and logical user naming. It provides for access to and organization of data independent of their location, as well as authentication and authorization controls that are not dependent upon the storage repositories. The Grid File System working group of the Global Grid Forum is exploring virtualization of file names for file systems [9]. The SRB virtualizes naming across all types of storage systems [10]. Process and workflow virtualization – this level provides a virtualization of operations performed on datasets either remotely or through the use of grid computing services to execute workflows across multiple compute platforms. It automates scheduling, data movement and application execution on remote, and possibly multiple computational platforms [11]. The Kepler workflow system uses actors to define allowed operations, and controls interactions between actors [12]. Kepler actors have been developed that manage access to SRB data collections. Information virtualization – this level provides a way to associate metadata with the data. Multiple types of metadata may be specified from simple attribute-value- unit triplets to semi-structured metadata. The gLite Metadata Catalog associates metadata with entities that are defined by globally unique identifiers [13]. The SRB system associates metadata directly with the logical file name space [14]. The data virtualization/abstraction as discussed above still does not provide complete transparency for data and information management. The additional level that is needed is “policy/constraint virtualization” which provides virtualization at the data management level as opposed to data and application (workflow) levels. The development of a constraint-based