N-Way Heterogeneous Blocking Mayank Kejriwal University of Texas at Austin kejriwal@cs.utexas.edu Daniel P. Miranker University of Texas at Austin miranker@cs.utexas.edu Abstract—Record linkage concerns the linkage of records between two tabular datasets. To avoid naive quadratic com- putation, typical solutions employ a technique called blocking.A blocking scheme partitions records into blocks, and generates a candidate set by pairing records within a block. Current models of blocking have been restricted to two homogeneous datasets. The variety aspect of Big Data motivates heterogeneous record linkage; hence, the blocking of N heterogeneous datasets is a worthy problem. In this paper, we deﬁne a framework for N- Way heterogeneous blocking. Our model subsumes the current binary model. Within our model, a blocking scheme is deﬁned on arbitrary numbers of heterogeneous relations, and shown to be a dependency relation. Necessary and sufﬁcient conditions for blocking scheme transitivity are further proved. We use the framework to generalize two popular binary blocking methods, traditional blocking and sorted neighborhood, to N-ary. To avoid worst case quadratic cost in N, the extended sorted neighborhood uses a novel dual windowing scheme. We show that a transitive blocking scheme enables the use of transitive closure algorithms to monotonically improve the candidate set till it is optimal. Thus, extended sorted neighborhood is shown to admit both qualitative and computational beneﬁts. Index Terms—N-Way heterogeneous Blocking, Record Link- age, Deduplication I. I NTRODUCTION With the advent of Big Data [1], pairing objects that refer to the same underlying entity has become a pressing issue that has attracted interest from industry and academia alike [2]. However, the number of such pairwise comparisons grows quadratically both in the number and sizes of datasets that need to be linked. Blocking methods address this brute-force cost by efﬁciently selecting a small subset of pairs that are considered to be good candidates for subsequent comparison, while discarding the vast majority clearly non-coreferent [3]. This subset of pairs, referred to as a candidate set, is then input to a second (usually machine learning) technique to identify true matches. The full process, record linkage, has received widespread research attention 1 and is comprehensively sur- veyed by Elmagarmid et al. [2]. A blocking method takes a blocking scheme and a set of tabular datasets as input, and partitions the datasets into blocks using the given scheme. Records within each block are paired in an algorithm-dependent manner and added to the candidate set, initially empty. Despite decades of research, the current model on blocking makes two restrictive assumptions. The ﬁrst is that at most two tabular datasets are provided, while 1 Different communities refer to the same problem by different names, e.g. entity resolution and co-reference resolution the second assumes the two datasets to be homogeneous, that is, they are assumed to contain identical attributes in their schemas. The variety facet of Big Data renders these assumptions problematic. Disregarding heterogeneity, current 2-Way blocking methods only mitigate quadratic cost of pairwise record comparisons given two relations. A straight- forward decomposition of N relations into binary problems still incurs quadratic cost: O(N 2 ) times the cost of conducting blocking on each pair of relations. N-Way blocking proposes to mitigate this by considering holistic blocking on all relations. The model that we propose also relaxes the assumption of homogeneity. Thus, it addresses a broader class of problems more suited to the needs of the Big Data era. We note that there is a precedent for similar extensions in other research areas. For example, holistic techniques have been considered for multi-class classiﬁcation using Support Vector Machines, instead of decomposing the problem into O(N 2 ) binary classiﬁcations [4]. In data integration, holistic schema mapping was proposed recently [5]. These precedents motivated us to propose a framework for N-Way heteroge- neous blocking. To the best of our knowledge, this is the ﬁrst work to do so. Within the framework proposed in this paper, the 2-Way homogeneous case is subsumed as a special instance. Thus, previous work on the specialized problem is compatible with the proposed model. We prove some key properties of the blocking schemes deﬁned in the model. Speciﬁcally, every blocking scheme in the framework is proved to be a dependency relation, that is, reﬂexive and symmetric. Transitivity is shown to be guaranteed only if the set of blocks generated by the scheme satisfy some graph partitioning properties. Furthermore, ex- tensions to popular blocking methods, traditional blocking and sorted neighborhood, are presented. A key sub-problem of sorted neighborhood, c-optimal ordering, is shown to be NP-hard. Thus far in the sorted neighborhood literature, this sub-problem has not been addressed. A motivating example also shows that a single windowing scheme, as used in the original sorted neighborhood, is quadratic in N. To address this, the extended sorted neighborhood proposed employs a dual windowing scheme. It generates a candidate set linear in the total number of records placed within the blocks. In the MapReduce framework, a cost linear in the size of the largest block is guaranteed. A key proof shows that if the blocking scheme is transitive, then a candidate set can be monotonically improved by performing transitive closure on a graph abstraction of the problem. If the transitive closure