arXiv:1807.02262v1 [cs.DB] 6 Jul 2018 Temporal graph-based clustering for historical record linkage Work-in-progress paper Charini Nanayakkara Research School of Computer Science, e Australian National University Canberra, ACT 2601, Australia charini.nanayakkara@anu.edu.au Peter Christen Research School of Computer Science, e Australian National University Canberra, ACT 2601, Australia peter.christen@anu.edu.au ilina Ranbaduge Research School of Computer Science, e Australian National University Canberra, ACT 2601, Australia thilina.ranbaduge@anu.edu.au ABSTRACT Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from differ- ent domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical cen- suses, as well as birth, death, and marriage certificates. Individu- ally, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, fami- lies, or households over time. Once such data sets are linked and family trees spanning several decades are available it is possible to, for example, investigate how education, health, mobility, em- ployment, and social status influence each other and the lives of people over two or even more generations. A major challenge is however the accurate linkage of historical data sets which is due to data quality and commonly also the lack of ground truth data being available. Unsupervised techniques need to be employed, which can be based on similarity graphs generated by comparing individual records. In this paper we present results from clustering birth records from Scotland where we aim to identify all births of the same mother and group siblings into clusters. We extend an existing clustering technique for record linkage by incorporating temporal constraints that must hold between births by the same mother, and propose a novel greedy temporal clustering technique. Experimental results show improvements over non-temporary ap- proaches, however further work is needed to obtain links of high quality. CCS CONCEPTS Information systems Entity resolution; Clustering; Tem- poral data; eory of computation Graph algorithms anal- ysis; KEYWORDS Entity resolution, birth records, Scoish, star clustering. 1 INTRODUCTION Databases that contain personal information, such as censuses or historical civil registries [25], generally contain multiple records describing the same individual (entity) or group of individuals such as families or households, where each individual will occur in such databases with different types of roles [7, 8]. A baby is born, then recorded as a daughter or son in a census, and later she or he might marry (as a bride or groom) and become the mother or father of her or his own children. Being able to link such records across different databases will allow the reconstruction of whole populations and open a multitude of studies in the health and social sciences that currently are not feasible on individual databases [3, 20]. e process of identifying the sets of records that correspond to the same individual is known as record linkage, entity resolution, or data matching [6]. Record linkage involves comparing pairs of records to decide if the records of a pair refer to the same entity (known as a match) or to different entities (a non-match). In such a comparison process generally the similarities between the val- ues of a selected set of aributes are compared to decide if a pair of records is similar enough to be classified as a match (if for ex- ample the similarities are above a pre-define threshold value). In many application domains this simple pair-wise linkage process does however not provide enough information to identify the rela- tionships between different individuals [7, 11]. Recently, in contrast to traditional pair-wise record linkage, group linkage [24] has received significant aention because of its appli- cability of linking groups of individuals, such as families or house- holds [8, 15]. e identification of relationships between individu- als can enrich data and improve the quality of data, and thus facil- itate more sophisticated analysis of different socio-economic fac- tors (such as health, wealth, occupation, and social structure) of large populations [13, 16]. Studying these issues are important to identify how societies evolve over time and discover the changes that influenced and contributed for social evolution [12]. Historical record linkage involves the linkage of historical records, including records from censuses as well as from birth, death, and marriage certificates, to construct longitudinal data sets about a population. Over the past two decades researchers working in dif- ferent domains have studied the problem of historical record link- age. In 1996 Dillon investigated an approach to link census records from the US and Canada to generate a longitudinal database to ex- amine changes in household structures [10]. e Integrated Public Use Microdata Series (IPUMS, see: hps://www.ipums.org/) is a large project initiated by the Minnesota Population Centre (MPC) for linking large demographic data collections. e Life-M project is another example of transforming records from birth, marriage, and death certificates as well as census records into an intergen- erational longitudinal database [2]. e project considers US data from the 19th and 20th centuries and aims to use birth certificates as a basis for historical record linkage of large historical databases. e Digitising Scotland project [9], which this work is a part of, aims to transcribe and link all civil registration events recorded in Scotland between 1856 and 1973. Around 14 million birth, 11 million death, and 4 million marriage records need to be linked