The VLDB Journal (2005) 0–25 DOI 10.1007/s00778-005-0156-6 REGULAR PAPER Deepavali Bhagwat · Laura Chiticariu · Wang-Chiew Tan · Gaurav Vijayvargiya An annotation management system for relational databases Received: 30 November 2004 / Revised version: 12 April 2005 / Published online: 25 October 2005 c Springer-Verlag 2005 Abstract We present an annotation management system for relational databases. In this system, every piece of data in a relation is assumed to have zero or more annotations associated with it and annotations are propagated along, from the source to the output, as data is being transformed through a query. Such an annotation management system could be used for understanding the provenance (aka lin- eage) of data, who has seen or edited a piece of data or the quality of data, which are useful functionalities for applica- tions that deal with integration of scientific and biological data. We present an extension, pSQL, of a fragment of SQL that has three different types of annotation propagation schemes, each useful for different purposes. The default scheme propagates annotations according to where data is copied from. The default-all scheme propagates annotations according to where data is copied from among all equiv- alent formulations of a given query. The custom scheme al- lows a user to specify how annotations should propagate. We present a storage scheme for the annotations and describe algorithms for translating a pSQL query under each prop- agation scheme into one or more SQL queries that would correctly retrieve the relevant annotations according to the specified propagation scheme. For the default-all scheme, we also show how we generate finitely many queries that can simulate the annotation propagation behavior of the set of all equivalent queries, which is possibly infinite. The algorithms are implemented and the feasibility of the system is demon- strated by a set of experiments that we have conducted. Keywords Data provenance · Lineage · Annotation propagation · Metadata D. Bhagwat · L. Chiticariu (B ) · W.-C. Tan · G. Vijayvargiya Department of Computer Science, University of California, Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA E-mail: laura@cs.ucsc.edu 1 Introduction For many scientific domains, new databases are often cre- ated to support the data analysis needs of domain-specific scientists. Some examples of such databases from biology include UniProt [1] and SWISS-PROT [2]. Data that is col- lected from other sources is often cleansed and reformatted before it is compiled into a new database. Furthermore, it is common for such newly created databases to contain new analysis or results that are derived by scientists. By associ- ating old and new data together in the new database, an inte- grated perspective is provided to scientists and this is critical for further analysis and scientific discovery. Very often, there is information about data that is not kept in the database but one would like to propagate this information along as data is being moved around. Examples include information about the perceived accuracy or reliability of experimental results by domain experts, or information about who has seen or edited a piece of data. In fact, our initial motivation for the design of a system that can propagate additional information around is to propagate the provenance of data items along as data is being copied. With the proliferation of many such interdependent databases (see [3] for a catalog of biology databases), it is natural to ask what is the provenance of a piece of data (i.e., where that piece of data is copied or cre- ated from) in a database. Understanding the provenance of data is important towards understanding the quality of data which may help, for example, a scientist to decide on the amount of trust to place on a piece of information that she encounters in a database. We describe an annotation management system for re- lational databases where every column of every tuple in every relation can be annotated with zero or more anno- tations. We use the term annotation to mean information about data such as provenance, comments, or other types of metadata. The annotations are automatically propagated along as data is being transformed through a query. In its default behavior, our system propagates annotations based on where data is copied from. As a consequence, if ev- ery column of every tuple in a database is annotated with