Dynamic Indexing for Incremental Entity Resolution in Data
Integration Systems
Priscilla Kelly M. Vieira
1,2
, Bernadette Farias Lóscio
1
and Ana Carolina Salgado
1
1
Federal University of Pernambuco, Center of Informatics, Recife, Pernambuco, Brazil
2
Federal Rural University of Pernambuco, Recife, Pernambuco, Brazil
Keywords: Data Integration, Entity Resolution, Data Matching, Duplicate Detection, Indexing.
Abstract: Entity Resolution (ER) is the problem of identifying groups of tuples from one or multiple data sources that
represent the same real-world entity. This is a crucial stage of data integration processes, which often need to
integrate data at query time. This task becomes even more challenging in scenarios with dynamic data sources
or with a large volume of data. As most ER techniques deal with all tuples at once, new solutions have been
proposed to deal with large volumes of data. One possible approach consists in performing the ER process on
query results rather than the whole data set. It is also possible to reuse previous results of ER tasks in order to
reduce the number of comparisons between pairs of tuples at query time. In a similar way, indexing techniques
can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons
between pairs of tuples. In this context, this work proposes an indexing technique for incremental Entity
Resolution processes. The expected contributions of this work are the specification, the implementation and
the evaluation of the proposed indexes. We performed some experiments and the time spent for storing,
accessing and updating the indexes was measured. We concluded that the reuse turns the ER process more
efficient than the reprocessing of tuples comparison and with similar quality of results.
1 INTRODUCTION
In the last years, companies and government
organizations around the world increased their
production of digital data. In general, these data are
stored in multiple data sources, which can be
heterogeneous and dynamic. To access and analyze
these data in a uniform and integrated fashion, data
integration strategies are needed. The aim of data
integration is to combine heterogeneous and
autonomous data sources for providing a single view
to the user (Gruenheid et al, 2014). One of the main
steps of the data integration process is the Entity
Resolution (ER) (Christen, 2012).
The ER process aims to identify tuples from one
or multiple data sources referring to the same real-
world entity. This problem has been the focus of
several works (Christen, 2012) and it is known by a
variety of names: Record Linkage, Entity Resolution,
Object Reference, Reference Linkage, Duplicate
Detection or Deduplication. In this paper, we adopt
the term Entity Resolution (Christen, 2012).
Given a large volume of data, ER can be a very
costly and time-consuming process. In general, the
most cost-demanding task of the ER process is the
tuple pair comparison, which requires the comparison
of every pair of tuples to calculate the corresponding
similarity. To reduce costs, ER can be performed in
an incremental way. In this case, just a subset of the
available tuples, i.e., an increment, is processed and
compared at each iteration of the ER process.
Additionally, results of previous iterations can be
reused during the comparison of new tuples. Doing
this, the volume of classified tuples increases
incrementally reducing the costs of the overall ER
process.
In this paper, we focus on an incremental ER
approach over query results. This means that the
increment is the query result and the ER should be
performed at query execution time. Given that we are
dealing with large volumes of data, performing the
ER at query time is even more challenging. Among
the solutions proposed in the literature to deal with
this challenge, we are interested on the use of
indexing techniques (Christen, 2012).
To reduce the costs of performing ER at query
execution time, we propose a dynamic indexing
technique. The dynamic indexes are available in main
memory, reducing the costs of disk access, and can be
Vieira, P., Lóscio, B. and Salgado, A.
Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems.
DOI: 10.5220/0006251801850192
In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 1, pages 185-192
ISBN: 978-989-758-247-9
Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
185