Parallel Meta-blocking for Scaling Entity Resolution over Big Heterogeneous Data Vasilis Efthymiou 1 , George Papadakis 2 , George Papastefanatos 3 , Kostas Stefanidis 4 , Themis Palpanas 5 1 University of Crete, Greece & ICS-FORTH, Greece vefthym@ics.forth.gr 2 University of Athens, Greece gpapadis@di.uoa.gr 3 Athena Research Center, Greece gpapas@imis.athena-innovation.gr 4 University of Tampere, Finland kostas.stefanidis@uta.fi 5 Paris Descartes University, France themis@mi.parisdescartes.fr Abstract Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suces to perform comparisons only within each block. To further increase eciency, Meta-blocking is being used to clean the overlapping blocks from unnecessary comparisons, increasing precision by orders of magnitude at a small cost in recall. Despite its high time eciency though, using Meta-blocking in practice to solve entity resolution problem on very large datasets is still challenging: applying it to 7.4 million entities takes (almost) 8 full days on a modern high-end server. In this paper, we introduce scalable algorithms for Meta-blocking, exploiting the MapReduce framework. Specifically, we describe a strategy for parallel execution that explicitly targets the core concept of Meta-blocking, the blocking graph. Furthermore, we propose two more advanced strategies, aiming to reduce the overhead of data exchange. The comparison-based strategy creates the blocking graph implicitly, while the entity-based strategy is independent of the blocking graph, employing fewer MapReduce jobs with a more elaborate processing. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the feasibility and superiority of our advanced strategies, and demonstrates their scalability to very large datasets. Keywords: Meta-blocking, Map/Reduce Model, Parallelization 1. Introduction Entity resolution (ER) is a very common task in Big Data processing, where dierent entity profiles, usually described under dierent schemas, are mapped to the same real-world object. Beyond the deduplication and cleaning problems that appear in traditional data integration, such as data warehouses, ER is a prerequisite for many Web applications, posing several challenges due to the volume and variety of the data collections. In general, ER constitutes an inherently quadratic task; given an entity collection, each entity profile must be compared to all others. Several approaches exist that aim to reduce the set of possible comparisons to be performed between two data collec- tions [1]. Blocking is a typical method that reduces the number of pairwise comparisons by placing similar entity profiles into blocks and performing only the comparisons within each block. Redundancy, i.e., placing every entity into multiple blocks, is employed by most blocking methods that handle noisy data [2, 3]. In fact, most blocking methods are redundancy- positive [4, 5]: the more blocks two entities share, the more likely they are to be matching. As an example, consider the simple approach of Token Blocking [6], which creates a block for every token that appears in the attribute values of at least two entities. Applying it to the entities in Figure 1(a), it yields the blocks in Figure 1(b), which place both pairs of matching entities, e 1 -e 3 and e 2 -e 4 , in at least one block. Thus, they can be detected, despite their noisy, heterogeneous descriptions. On the flip side, redundancy entails two kinds of unneces- sary comparisons: the rendundant ones repeatedly compare the same entity profiles in multiple blocks, while the superfluous ones describe comparisons of profiles that are not matching. For example, the blocks b 2 and b 4 in Figure 1(b) contain one redundant comparison each, repeated in b 1 and b 3 , respectively; given that entities e 1 and e 2 match with e 3 and e 4 , respectively, the blocks b 5 , b 6 , b 7 and b 8 contain superfluous comparisons (the only exception is the redundant comparison e 3 -e 5 in b 8 , which is repeated in b 6 ). In total, the blocks of Figure 1(b) in- volve 13 comparisons, of which 3 are redundant and 8 superflu- ous. Such comparisons increase the computational cost without contributing any identified duplicates. Current state-of-the-art. Numerous studies have focused on the problem of block processing, whose goal is to discard unnecessary (both redundant and superfluous) comparisons in order to enhance the precision of block collections. Most of the relevant techniques involve a functionality that operates at the block level, based on coarse-grained characteristics of the input block collection, such as the size of blocks: Block Purging [6] a-priori discards oversized blocks like b 8 in Figure 1(b), while Block Pruning [6] orders blocks from smallest to largest and ter- minates their processing as soon as the cost of identifying new duplicates exceeds a predefined threshold. Such techniques are December 1, 2016 This is the accepted manuscript of the article, which has been published in Information Systems. 2017, 65, 137-157. http://dx.doi.org/10.1016/j.is.2016.12.001