An Unsupervised Algorithm for Learning Blocking Schemes Mayank Kejriwal University of Texas at Austin kejriwal@cs.utexas.edu Daniel P. Miranker University of Texas at Austin miranker@cs.utexas.edu Abstract—A pairwise comparison of data objects is a requisite step in many data mining applications, but has quadratic complexity. In applications such as record linkage, blocking methods may be applied to reduce the cost. That is, the data is first partitioned into a set of blocks, and pairwise comparisons computed for pairs within each block. To date, blocking methods have required the blocking scheme be given, or the provision of training data enabling supervised learning algorithms to determine a blocking scheme. In either case, a domain expert is required. This paper develops an unsupervised method for learning a blocking scheme for tabular data sets. The method is divided into two phases. First, a weakly labeled training set is generated automatically in time linear in the number of records of the entire dataset. The second phase casts blocking key discovery as a Fisher feature selection problem. The approach is compared to a state-of-the-art supervised blocking key discovery algorithm on three real-world databases and achieves favorable results. Index Terms—Blocking, Record Linkage I. I NTRODUCTION Record Linkage, or the identification of entities within a database that are coreferent, is a long standing problem with no less than eight separate terms referring to the same problem [1]. Wikipedia 1 lists at least fifteen different names, and despite much research the problem does not have an automated solution. Ad hoc and domain dependent solutions are still common, with human intervention required. Record Linkage typically requires two primary steps [1]. The first step is referred to as blocking. Blocking methods mitigate full pairwise comparisons by selecting a small subset of pairs from the database that are considered to be good candidates for pairwise comparison, while discarding the vast majority of pairs that are clearly non-coreferent. Without blocking, each entity must be compared with every other entity to determine whether the two corefer. This naive approach grows quadratically with the input, and is impractical for large databases; hence, the need for blocking. The blocking step is comprehensively surveyed by Christen [2]. The pairs generated by blocking are then used as input for a second step, which typically involves machine-learning techniques, among others, to isolate duplicates according to some similarity measure. The second step is comprehensively surveyed by Elmagarmid et al. [1]. The blocking phase of this two-step procedure has thus far 1 https://en.wikipedia.org/wiki/Record linkage required a human in the loop. This is because blocking methods require a blocking scheme, a function assumed to be provided by a domain expert [2]. Although multiple methods have investigated the use of a given scheme in a variety of ways [2], there has been negligible research on learning the scheme itself. Two papers sought to address this gap by learn- ing schemes given training data [3], [4]. However, labeling duplicates in large databases is troublesome, particularly if duplicates are sparse or the data is confidential. As the size and diversity of datasets continues to grow in the current era of Big Data, the need for an automated procedure is pressing. In this paper, an unsupervised method is presented for learning blocking schemes. The algorithm runs in two separate phases. In the first phase, the algorithm efficiently generates a weakly labeled training set. In the second phase, the problem of learning blocking schemes from this weakly labeled set is cast as a feature selection problem. The validity of both phases of the algorithm is demonstrated on three real-world datasets. The outline of this paper is as follows. Section II lays out the formalism of blocking schemes and describes the blocking step in detail. Section III proposes an algorithm to generate a weakly labeled training set, and presents a worst-case analysis. Section IV describes the feature selection procedures that take the weakly labeled set as input and output a blocking scheme. Section V describes the experiments conducted, including methodology and datasets. Section VI presents the results and a discussion. Section VII presents related work. Finally, Section VIII details future work and concludes the paper. II. BLOCKING SCHEMES The formalism for the rest of the paper is introduced, along with illustrative examples. For consistency, terminology proposed by Bilenko et al. is used [4], although many terms below were not formally defined in that work. A. Definitions and Examples The most basic elements of a blocking scheme are indexing functions h i (x t ) [4]. An indexing function accepts a field value from a tuple as input and returns a set Y that contains 0 or more blocking key values (BKVs). A BKV identifies a block in which the tuple is placed. Intuitively, one may think of a block as a hash bucket, but more often the function is used to sort the records [5]. If the set Y contains multiple BKVs, then the tuple is assigned to multiple blocks.