Re-identification of Anonymized CDR datasets Using Social Network Data Alket Cecaj, Marco Mamei, Nicola Bicocchi Universit` a di Modena e Reggio Emilia, Italy Email: {alket.cecaj,marco.mamei,nicola.bicocchi}@unimore.it Abstract—In this work we examine a large dataset of 335 million anonymized call records made by 3 million users during 47 days in a region of northern Italy. Combining this dataset with publicly available user data, from different social networking ser- vices, we present a probabilistic approach to evaluate the potential of re-identification of the anonymized call records dataset. In this sense, our work explores different ways of analyzing data and data fusion techniques to integrate different mobility datasets together. On the one hand, this kind of approaches can breach users’ privacy despite anonymization, so it is worth studying carefully. On the other hand, combining different datasets is a key enabler for advanced context-awareness in that information form multiple sources can complement and enrich each other. I. I NTRODUCTION Mobile devices are making available a vast quantity of data about people mobility and activity. On the one hand, telecom companies have the possibility of monitoring a large number of mobile terminals as they connect to the network. On the other hand, as Internet connections and smartphones become more affordable, mobile applications and social networks can provide data about their users in a geo-referenced format. In particular services such as Twitter, Flickr or Foursquare pro- vide data, that tell us a lot about people presence and actions in a determined context [17]. By analyzing the geography of these data is possible to study human and crowd behavior in a large scale [6], [16]. As mobility is a primary source of context-information for several applications, the use of this kind of data is important and can have a strong impact in the fulfillment of the pervasive-anticipatory computing vision [14]. In this domain, an important, but problematic, activity consists in joining datasets by matching different users as- sociated to the same real person. For example, it would be interesting to realize that user X in a CDR dataset is actually the same person as Twitter user Y , and then join the two datasets. Using mobility and geo-referenced data, this kind of matching process is rather straightforward in principle and consists in identifying whether CDR user X and Twitter user Y consistently produce data at the same time and place [11]. Once enough geo-referenced elements overlap, we can be reasonably sure that the two users are actually the same person. On the one hand, this could raise serious privacy issues, as relations between different types of data can be used to infer information of any kind from socio-economic status, to mobility and shopping patterns, to the user’s social graph [21]. This is particularly problematic once the process of matching users among data sources allows to bypass the anonymization of a given dataset. As discussed in [21] it may in fact happen that: “The continued accumulation of location data may reach a point where a marketer can uniquely match an anonymous location trace to a named record in a separate database”. On the other hand – for the same reason – joining different datasets is the key for advanced forms of context awareness that could notably improve pervasive applications and services. In fact on the basis of such a combined dataset, it would be possible to infer what the users were doing in given location and their general profile. The contribution of this paper is to conduct analysis and experiments in the above direction. Specifically, we try to answer the following questions. Can we use data from geo- referenced social networks, to re-identify mobile users from an anonymized Call Description Records dataset? We start answering these questions by using a probabilistic approach, that evaluates the probability that users from multiple datasets are actually the same person. The content of this article, it is organized as follows. Section 2 presents researches at the state of the art in entity matching among multiple data sources. Section 3 presents the CDR and social network datasets we used for our analysis. Section 4 presents initial re-identification results based on counting the number of matches among events generated by users across the two datasets. Section 5 presents our proba- bilistic model to assess whether different users are actually the same person and provides experiments in this direction. Finally in Section 6, we present our conclusions and future works. II. RELATED WORK As large-scale mobility and social network data is progres- sively available to researchers, there is a considerable amount of works on data re-identification as a mean to threaten users’ privacy. The vast majority of works deal with the problem from the data uniqueness perspective: what is the subset of data about someone to make him/her unique and thus re-identifiable among all the other users? In [8], for example, authors analyze census data discovering that the disclosing of gender, ZIP and full date of birth allows for unique identification of 63% of individuals of the US population. Many studies explore the re-identification of datasets, such as movie ratings as in the NetFlix Prize [12] or Massachusetts Hospital medical records using publicly available side information. Another interesting case is the re- identification of anonymous volunteers in a DNA study for the Personal Genome Project [19]. More in line with our domain, in [11] authors analyse a large CDR dataset discovering that 4 CDR events are enough to