Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems Priscilla Kelly M. Vieira 1,2 , Bernadette Farias Lóscio 1 and Ana Carolina Salgado 1 1 Federal University of Pernambuco, Center of Informatics, Recife, Pernambuco, Brazil 2 Federal Rural University of Pernambuco, Recife, Pernambuco, Brazil Keywords: Data Integration, Entity Resolution, Data Matching, Duplicate Detection, Indexing. Abstract: Entity Resolution (ER) is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query time. This task becomes even more challenging in scenarios with dynamic data sources or with a large volume of data. As most ER techniques deal with all tuples at once, new solutions have been proposed to deal with large volumes of data. One possible approach consists in performing the ER process on query results rather than the whole data set. It is also possible to reuse previous results of ER tasks in order to reduce the number of comparisons between pairs of tuples at query time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an indexing technique for incremental Entity Resolution processes. The expected contributions of this work are the specification, the implementation and the evaluation of the proposed indexes. We performed some experiments and the time spent for storing, accessing and updating the indexes was measured. We concluded that the reuse turns the ER process more efficient than the reprocessing of tuples comparison and with similar quality of results. 1 INTRODUCTION In the last years, companies and government organizations around the world increased their production of digital data. In general, these data are stored in multiple data sources, which can be heterogeneous and dynamic. To access and analyze these data in a uniform and integrated fashion, data integration strategies are needed. The aim of data integration is to combine heterogeneous and autonomous data sources for providing a single view to the user (Gruenheid et al, 2014). One of the main steps of the data integration process is the Entity Resolution (ER) (Christen, 2012). The ER process aims to identify tuples from one or multiple data sources referring to the same real- world entity. This problem has been the focus of several works (Christen, 2012) and it is known by a variety of names: Record Linkage, Entity Resolution, Object Reference, Reference Linkage, Duplicate Detection or Deduplication. In this paper, we adopt the term Entity Resolution (Christen, 2012). Given a large volume of data, ER can be a very costly and time-consuming process. In general, the most cost-demanding task of the ER process is the tuple pair comparison, which requires the comparison of every pair of tuples to calculate the corresponding similarity. To reduce costs, ER can be performed in an incremental way. In this case, just a subset of the available tuples, i.e., an increment, is processed and compared at each iteration of the ER process. Additionally, results of previous iterations can be reused during the comparison of new tuples. Doing this, the volume of classified tuples increases incrementally reducing the costs of the overall ER process. In this paper, we focus on an incremental ER approach over query results. This means that the increment is the query result and the ER should be performed at query execution time. Given that we are dealing with large volumes of data, performing the ER at query time is even more challenging. Among the solutions proposed in the literature to deal with this challenge, we are interested on the use of indexing techniques (Christen, 2012). To reduce the costs of performing ER at query execution time, we propose a dynamic indexing technique. The dynamic indexes are available in main memory, reducing the costs of disk access, and can be Vieira, P., Lóscio, B. and Salgado, A. Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems. DOI: 10.5220/0006251801850192 In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 1, pages 185-192 ISBN: 978-989-758-247-9 Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved 185