A Query-driven and Incremental Process for Entity Resolution Priscilla Kelly M. Viera 1, 2 , Ana Carolina Salgado 1 , Bernadette Farias Lóscio 1 1 Federal University of Pernambuco, Center for Informatics, Recife, Pernambuco, Brazil {pkmv, acs, bfl}@cin.ufpe.br 2 Federal Rural University of Pernambuco, Recife, Pernambuco, Brazil 1 Introduction Companies and governmental organizations around the world publish a huge volume of data, which can be stored in multiple data sources. In order to access and analyze these data, strategies for data integration are needed. The aim of data integration is to combine heterogeneous and autonomous data sources for providing a single view to the user [1]. An important component of the data integration process is the Entity Resolution (ER) task [2]. The ER goal is to identify tuples referring to the same real- word entity (in this work, tuple is synonymous of instance and record). This problem is known by a variety of names: Record Linkage, Entity Resolution, Object Reference, Reference Linkage, Duplicate Detection or Deduplication. In this paper, we adopt the term Entity Resolution (ER) [2]. Often, companies and organizations have to deal with dynamic data sources with a large volume of data. In this context, the ER process can be very challenging because most current available ER techniques process all the entities at one time [3]. This occurs because most of these techniques are based on batch algorithms, which resolve all tuples instead of resolving those related to a single query [4, 5, 6]. Then, arises the need of new techniques to support real-time ER for dynamic and large databases. For example, suppose a set of data sources of bibliographic data and a query to retrieve all papers from a given author (e.g. "Getoor"). To answer this query, it is not necessary to look for other author’s papers and to perform the ER considering the whole set of papers. In this case, it would be better to focus on the tuples describing just papers from the author specified in the query. In this paper, we propose a QUery-Driven and Incremental process for Entity Resolution (QuID). The QuID process considers query results on multiple data sources. It is an incremental process, i.e., for each new query result, QuID reuses the previous ER clusters to answer future queries. In our approach, ER is considered as a clustering problem [7], in which each cluster corresponds to tuples of a single real-world entity. During the ER, the results of queries are analyzed, and each tuple of the query result is inserted incrementally in a cluster. Our solution holds an index for the tuples, and performs incremental clustering, resulting in clusters of tuples that refer to the same real-world entity. The rest of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we formally define the problem and describe the QuID process and in Section 4 we conclude.