Discovering Correlations in Annotated Databases Xuebin He Stephen Donohue Mohamed Y. Eltabakh Worcester Polytechnic Institute, Computer Science Department, MA, USA {xhe2, donohues, meltabakh}@cs.wpi.edu ABSTRACT Most emerging applications, especially in science domains, main- tain databases that are rich in metadata and annotation information, e.g., auxiliary exchanged comments, related articles and images, provenance information, corrections and versioning information, and even scientists’ thoughts and observations. To manage these annotated databases, numerous techniques have been proposed to extend the DBMSs and efficiently integrate the annotations into the data processing cycle, e.g., storage, indexing, extended query lan- guages and semantics, and query optimization. In this paper, we address a new facet of annotation management, which is the dis- covery and exploitation of the hidden corrections that may exist in annotated databases. Such correlations can be either between the data and the annotations (data-to-annotation), or between the anno- tations themselves (annotation-to-annotation). We make the case that the discovery of these annotation-related correlations can be exploited in various ways to enhance the quality of the annotated database, e.g., discovering missing attachments, and recommend- ing annotations to newly inserted data. We leverage the state-of- art in association rule mining in innovative ways to discover the annotation-related correlations. We propose several extensions to the state-of-art in association rule mining to address new challenges and cases specific to annotated databases, i.e., incremental addition of annotations, and hierarchy-based annotations. The proposed al- gorithms are evaluated using real-world applications from the bio- logical domain, and an end-to-end system including an Excel-based GUI is developed for seamless manipulation of the annotations and their correlations. 1. INTRODUCTION Most modern applications annotate and curate their data with various types of metadata information—usually called annotations, e.g., provenance information, versioning timestamps, execution statistics, related comments or articles, corrections and conflict- related information, and auxiliary exchanged knowledge from dif- ferent users. Interestingly, the number and size of these annotations is growing very fast, e.g., the number of annotations is around 30x, 120x, and 250x larger than the number of data records in Data- c 2016, Copyright is with the authors. Published in Proc. 19th Inter- national Conference on Extending Database Technology (EDBT), March 15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 Bank biological database [3], Hydrologic Earth database [4, 47], and AKN ornithological database [5], respectively. Existing tech- niques in annotation management, e.g., [9, 15, 17, 21, 24], have made it feasible to systematically capture such metadata annota- tions and efficiently integrate them into the data processing cy- cle. This includes propagating the related annotations along with queries’ answers [9, 15, 17, 24, 46], querying the data based on their attached annotations [21, 24], and supporting semantic an- notations such as provenance tracking [11, 14, 20, 43], and belief annotations [23]. Such integration is vey beneficial to higher-level applications as it complements the base data with the auxiliary and semantic-rich source of annotations. In this paper, we address a new facet of annotation management that did not receive much attention before and has not been ad- dressed by existing techniques. This facet concerns the discovery and exploitation of the hidden correlations that may exist in anno- tated databases. Given the growing scale of annotated databases— both the base data and the annotation sets—important correlations may exist either between the data values and the annotations, i.e., data-to-annotations correlations, or among the annotations them- selves, i.e., annotations-to-annotations correlations. By systemati- cally discovering such correlations, applications can leverage them in various ways as motivated by the following scenarios. Motivation Scenario 1Discovery of Missing Attachments: As- sume the example biological database illustrated in Figure 1. Typi- cally, many biologists may annotate subsets of the data over time— each scientist focuses only on few genes of interest at a time. For example, some of the data records in Figure 1 are annotated with a “Black Flag” annotation. This annotation may represent a scien- tific article or a comment that is attached to these tuples. By ana- lyzing the data, we observe that most genes having value F1 in the Family column have an attached “Black Flag” annotation. Such correlation suggests that gene JW0012 is probably missing this annotation, e.g., none of the biologists was working on that gene and thus the article did not get attached to it. However, by discov- ering the aforementioned correlation, the system can proactively learn and recommend this missing attachment to domain experts for verification. Correlations may also exist among the annotations themselves, e.g., between the “Black Flag” and the “Red Flag” an- notations. Without discovering such correlations the database may become “under annotated” due to these missing attachments. Motivation Scenario 2Annotation Maintenance under Evolv- ing Data: Data is always evolving and new records are always added to the database. Hence, a key question is: “For the newly added data records, do any of the existing annotations apply to them?”. Learning the correlations between the data and the an- notations can certainly help in answering such question. For ex-