Identifying users profiles from mobile calls habits Barbara Furletti KDDLAB - ISTI CNR Pisa, Italy barbara.furletti@isti.cnr.it Lorenzo Gabrielli KDDLAB- ISTI CNR Pisa, Italy lorenzo.gabrielli@isti.cnr.it Salvatore Rinzivillo KDDLAB - ISTI CNR Pisa, Italy salvatore.rinzivillo@isti.cnr.it Chiara Renso KDDLAB - ISTI CNR Pisa, Italy chiara.renso@isti.cnr.it ABSTRACT The huge quantity of positioning data registered by our mo- bile phones stimulates several research questions, mainly originating from the combination of this huge quantity of data with the extreme heterogeneity of the tracked user and the low granularity of the data. We propose a methodology to partition the users tracked by GSM phone calls into pro- files like resident, commuters, in transit and tourists. The methodology analyses the phone calls with a combination of top-down and bottom up techniques where the top-down phase is based on a sequence of queries that identify some behaviors. The bottom-up is a machine learning phase to find groups of similar call behavior, thus refining the pre- vious step. The integration of the two steps results in the partitioning of mobile traces into these four user categories that can be deeper analyzed, for example to understand the tourist movements in city or the traffic effects of commuters. An experiment on the identification of user profiles on a real dataset collecting call records from one month in the city of Pisa illustrates the methodology. Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Algorithms Keywords GSM Data, User profiles, SOM 1. INTRODUCTION 4.4 billions or users worldwide, 838 GSM networks spread in 234 countries, 1.44M new GSM subscribers every day are only a few of the impressive numbers that witnesses the enor- mous diffusion of the GSM phenomena since its first network Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. UrbComp’12, August 12, 2012. Beijing, China Copyright 2012 ACM 978-1-4503-1542-5/08/2012 ...$15.00. launched at the beginning of ’90 [7]. This massive quantity of mobile phones are moving everyday with their human com- panions, leaving tracks of theirs movements. These tracks represents the mobility of millions of people in the Earth sur- face and the opportunities to use these data for analysing and understanding human mobility are tremendous. Re- search literature has seen a growing interest in techniques for analysing mobility of users based on GSM position data. This research has also been driven by an increasing number of applications that has found in mobile phone data a good partner for discovering interesting results on people behav- ior. The advantages of relying on these kind of data, com- pared to standard survey based data collection, is that they offer a wide coverage of the people presence in an area, they are heterogeneous from the point of view of the tracked per- son and they tend to be up to date and easily upgradeable with new automatic data collection. However, these huge quantity of humans location data comes with a price. Due to privacy reasons, the telecommunication provider must anonymize the data. Thus the analysis that can be done on such data does not distinguish the different user profiles. Therefore the heterogeneity of these data, besides being a strong point, is also a weak point. The mobility analysis that can be performed may suffer from biases due to the wide difference in mobility behavior of tracked users. How to determine, among all the positions collected in a city, which ones correspond to specific categories of users such as residents or visitors? Is it possible to distinguish them looking at they mobile usage? In this paper we face this problem proposing a method- ology to partition a population of users tracked by GSM mobile phones into four predefined user profiles: residents, commuters, in transit and tourists/visitors. Several appli- cations may benefit from the analysis of a partitioned set of users based on this mobility characteristics. For example, being able to distinguish between residents and commuters may help in traffic management to better understand how traffic is affected by the residents mobility compared to the commuters. Having identified the tourists/visitors, it is es- sential to study how the city is receiving people from out- side and how their movements are affecting the city. Again, being able to combine the mobility of resident population with the temporary population (like commuters, visitors or people in transit) may give a measure of the sustainabil- ity of the incoming population with respect to resident one. The population on a territory consumes resources like water,