The VLDB Journal (2009) 18:1141–1166
DOI 10.1007/s00778-009-0161-2
SPECIAL ISSUE PAPER
Creating probabilistic databases from duplicated data
Oktie Hassanzadeh · Renée J. Miller
Received: 14 September 2008 / Revised: 10 June 2009 / Accepted: 26 June 2009 / Published online: 20 August 2009
© Springer-Verlag 2009
Abstract A major source of uncertainty in databases is the
presence of duplicate items, i.e., records that refer to the same
real-world entity. However, accurate deduplication is a dif-
ficult task and imperfect data cleaning may result in loss of
valuable information. A reasonable alternative approach is
to keep duplicates when the correct cleaning strategy is not
certain, and utilize an efficient probabilistic query-answering
technique to return query results along with probabilities of
each answer being correct. In this paper, we present a flexible
modular framework for scalably creating a probabilistic data-
base out of a dirty relation of duplicated data and overview
the challenges raised in utilizing this framework for large
relations of string data. We study the problem of associating
probabilities with duplicates that are detected using state-
of-the-art scalable approximate join methods. We argue that
standard thresholding techniques are not sufficiently robust
for this task, and propose new clustering algorithms suitable
for inferring duplicates and their associated probabilities. We
show that the inferred probabilities accurately reflect the error
in duplicate records.
Keywords Probabilistic databases · Duplicate detection ·
String databases
Work supported in part by NSERC.
O. Hassanzadeh (B ) · R. J. Miller
Department of Computer Science, University of Toronto,
Toronto, Canada
e-mail: oktie@cs.toronto.edu
R. J. Miller
e-mail: miller@cs.toronto.edu
1 Introduction
The presence of duplicates is a major concern for the quality
of data in large databases. To detect duplicates, entity resolu-
tion, also known as duplication detection or record linkage is
used as a part of the data-cleaning process to identify records
that potentially refer to the same entity. Numerous dedupli-
cation techniques exist to normalize data and remove errone-
ous records [42]. However, in many real-world applications
accurately merging duplicate records and fully eliminating
erroneous duplicates is still a very human-labor intensive pro-
cess. Furthermore, full deduplication may result in the loss
of valuable information.
An alternative approach is to keep all the data and intro-
duce a notion of uncertainty for records that have been deter-
mined to potentially refer to the same entity. Such data would
naturally be inconsistent, containing sets of duplicate records.
Various methodologies exist with different characteristics for
managing uncertainty and inconsistency in data [2, 3, 15, 22,
51]. A large amount of previous work addresses the prob-
lem of efficient query evaluation on probabilistic databases
in which it is assumed that meaningful probability values are
assigned to the data in advance. Given these probabilities, a
query can return answers together with a probability of the
answer being correct, or alternatively return the top-k most
likely answers. For such approaches to work over duplicate
data, the record probabilities must accurately reflect the error
in the data.
To illustrate this problem, consider the dirty relations of
Fig. 1. To assign probabilities, we must first understand which
records are potential duplicates. For large data sets, a number
of scalable approximate join algorithms exist which return
pairs of similar records and their similarity scores (e.g., [4, 8,
38]). Given the result of an approximate join, we can group
records into sets of potential duplicates using a number of
123