Exploiting Spatio-Temporal User Behaviors for User Linkage Wei Chen School of Computer Science and Technology, Soochow University, China wchzhg@gmail.com Hongzhi Yin * School of ITEE, The University of Queensland, Brisbane, Australia db.hongzhi@gmail.com Weiqing Wang School of ITEE, The University of Queensland, Brisbane, Australia weiqingwang@uq.edu.au Lei Zhao School of Computer Science and Technology, Soochow University, China zhaol@suda.edu.cn Wen Hua School of ITEE, The University of Queensland, Brisbane, Australia w.hua@uq.edu.au Xiaofang Zhou School of ITEE, The University of Queensland, Brisbane, Australia zxf@itee.uq.edu.au ABSTRACT Cross-device and cross-domain user linkage have been at- tracting a lot of attention recently. An important branch of the study is to achieve user linkage with spatio-temporal da- ta generated by the ubiquitous GPS-enabled devices. The main task in this problem is twofold, i.e., how to extract the representative features of a user; how to measure the similar- ities between users with the extracted features. To tackle the problem, we propose a novel model STUL (Spatio-Temporal User Linkage) that consists of the following two components. 1) Extract users’ spatial features with a density based clus- tering method, and extract the users’ temporal features with the Gaussian Mixture Model. To link user pairs more precise- ly, we assign diﬀerent weights to the extracted features, by lightening the common features and highlighting the discrim- inative features. 2) Propose novel approaches to measure the similarities between users based on the extracted features, and return the pair-wise users with similarity scores higher than a predeﬁned threshold. We have conducted extensive experiments on three real-world datasets, and the results demonstrate the superiority of our proposed STUL over the state-of-the-art methods. KEYWORDS Cross-domain; User linkage; Spatio-temporal behaviors 1 INTRODUCTION The proliferation of GPS-enabled devices and mobile tech- niques has led to the emergence of large amount of spatio- temporal information. For example, the vehicles equipped * This author is the corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advan- tage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. CIKM’17, November 6–10, 2017, Singapore. © 2017 ACM. ISBN 978-1-4503-4918-5/17/11. . . $15.00 DOI: http://dx.doi.org/10.1145/3132847.3132898 with GPS can generate lots of trajectories, which consist of a sequence of points that are sampled in a short time period, to keep track of moving objects. Meanwhile, the widespread of location based social networks, such as Facebook, Twit- ter, and Foursquare have generated massive discrete check-in data [20], as many users share their status associated with lo- cations and timestamps. The availability of spatio-temporal information oﬀers a good opportunity to model users’ spatio- temporal behaviors [23][18]. On the other hand, user linkage, which aims at connecting the same users across diﬀerent plat- forms, has attracted much attention. User linkage beneﬁts widespread real applications, such as prediction [13][21], data fusion [28], recommendation [19][22], etc. This paper focus- es on leveraging the increasingly available spatio-temporal information in user linkage. However, to the best of our knowledge, there is only one work utilizing the users’ spatial and temporal features simul- taneously to achieve user linkage [14]. In that work, location- s and times are divided into bins, and each spatio-temporal record is associated with a bin (r, t) where r is a region and t represents a time interval. The similarities between users are inferred based on users’ co-occurrences in each bin. Nonethe- less, time and space are intrinsically continuous. Discretiza- tion of time and space inevitably leads to information loss, especially for the points near the boundaries. Assume that u0 is a user on platform A while u1 and u2 are two user- s on platform B. To simplify the problem, we assume that there is only one activity record v0, v1 and v2 for each user u0, u1 and u2 respectively. The distributions of these activ- ity records in terms of space and time are given in Figure 1(a) and 1(b) respectively. Based on [14], u0 and u2 have a larger probability to be linked together, as they co-occur in both the spatial bin r1 and the temporal bin t1. However, compared with u2, u1 is more similar to u0 in terms of both spatial distribution in Figure 1(a) and temporal distribution in Figure 1(b). Thus, the discretization based method can- not capture the similarity between features that are divided into diﬀerent bins. Besides, discretization of time and space always begs the question of selecting the region or time inter- val size, and the size is invariably too small for some regions and too large for others.