The VLDB Journal (2009) 18:1141–1166 DOI 10.1007/s00778-009-0161-2 SPECIAL ISSUE PAPER Creating probabilistic databases from duplicated data Oktie Hassanzadeh · Renée J. Miller Received: 14 September 2008 / Revised: 10 June 2009 / Accepted: 26 June 2009 / Published online: 20 August 2009 © Springer-Verlag 2009 Abstract A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a dif- ficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic data- base out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state- of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records. Keywords Probabilistic databases · Duplicate detection · String databases Work supported in part by NSERC. O. Hassanzadeh (B ) · R. J. Miller Department of Computer Science, University of Toronto, Toronto, Canada e-mail: oktie@cs.toronto.edu R. J. Miller e-mail: miller@cs.toronto.edu 1 Introduction The presence of duplicates is a major concern for the quality of data in large databases. To detect duplicates, entity resolu- tion, also known as duplication detection or record linkage is used as a part of the data-cleaning process to identify records that potentially refer to the same entity. Numerous dedupli- cation techniques exist to normalize data and remove errone- ous records [42]. However, in many real-world applications accurately merging duplicate records and fully eliminating erroneous duplicates is still a very human-labor intensive pro- cess. Furthermore, full deduplication may result in the loss of valuable information. An alternative approach is to keep all the data and intro- duce a notion of uncertainty for records that have been deter- mined to potentially refer to the same entity. Such data would naturally be inconsistent, containing sets of duplicate records. Various methodologies exist with different characteristics for managing uncertainty and inconsistency in data [2, 3, 15, 22, 51]. A large amount of previous work addresses the prob- lem of efficient query evaluation on probabilistic databases in which it is assumed that meaningful probability values are assigned to the data in advance. Given these probabilities, a query can return answers together with a probability of the answer being correct, or alternatively return the top-k most likely answers. For such approaches to work over duplicate data, the record probabilities must accurately reflect the error in the data. To illustrate this problem, consider the dirty relations of Fig. 1. To assign probabilities, we must first understand which records are potential duplicates. For large data sets, a number of scalable approximate join algorithms exist which return pairs of similar records and their similarity scores (e.g., [4, 8, 38]). Given the result of an approximate join, we can group records into sets of potential duplicates using a number of 123