Sieve: Linked Data Quality Assessment and Fusion Pablo N. Mendes, Hannes Mühleisen, Christian Bizer Web Based Systems Group Freie Universität Berlin Berlin, Germany, 14195 first.last@fu-berlin.de ABSTRACT The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a num- ber of challenges have to be overcome. One of these chal- lenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible. To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality as- sessment methods as well as fusion methods. Sieve is inte- grated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion. We demonstrate Sieve in a data integration scenario im- porting data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, con- ciseness and consistency through the use of our framework. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; H.2.5 [Information Systems]: Database Management— Heterogeneous databases Keywords Linked Data, RDF, Data Integration, Data Quality, Data Fusion, Semantic Web 1. INTRODUCTION The Web of Linked Data has seen an exponential growth Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LWDM2012 March 26–30, 2012, Berlin, Germany Copyright 2012 ACM 978-1-4503-0790-1/12/03 ...$10.00. over the past five years 1 . From 12 Linked Data sets cata- logued in 2007, the Linked Data cloud has grown to almost 300 data sets encompassing approximately 31 billion triples, according to the most recent survey conducted in September 2011 [10]. The information contained in each of these sources often overlaps. In fact, there are approximately 500 million ex- plicit links between data sets [10], where each link indicates that one data set ‘talks about’ a data item from another data set. Further overlapping information may exist, even though no explicit links have been established yet. For in- stance, two data sets may use different identifiers (URIs) for the same real world objects (e.g. Bill Clinton has an identi- fier in the English and the Portuguese DBpedia). Similarly, two different attribute identifiers may be used for equivalent attributes (e.g. both foaf:name and dbprop:name contain the name ‘Bill Clinton’.) Applications that consume data from the Linked Data cloud are confronted with the challenge of obtaining a ho- mogenized view of this global data space [8]. The Linked Data Integration Framework (LDIF) was created with the objective of supporting users in this task. LDIF is able to conflate multiple identifiers of the same object into a canon- ical URI (identity resolution), while mapping equivalent at- tributes and class names into a homogeneous target repre- sentation (schema mapping). As a result of such a data integration process, multiple values for the same attribute may be observed – e.g. orig- inating from multiple sources. For attributes that only ad- mit one value (e.g. total area or population of a city) this represents a conflict for the consumer application to resolve. With the objective of supporting user applications in dealing with such conflicts, we created Sieve - Linked Data Quality Assessment and Data Fusion. Sieve is included as a module in LDIF, and can be cus- tomized for user applications programmatically (through an open source Scala API) and through configuration parame- ters that describe users’ task-specific needs. Sieve includes a Quality Assessment module and a Data Fusion module. The Quality Assessment module leverages user-selected meta- data as quality indicators to produce quality assessment scores through user-configured scoring functions. The Data Fusion module is able to use quality scores in order to per- form user-configurable conflict resolution tasks. In this paper we demonstrate Sieve through a data inte- gration scenario involving the internationalized editions of DBpedia, which extracts structured data from Wikipedia. 1 http://lod-cloud.net