Big data analytics for smart mobility: a case study Roberto Trasarti 1 Barbara Furletti 1 Lorenzo Gabrielli 1 1 KDD Lab - ISTI - CNR Pisa, Italy name.surname@isti.cnr.it Mirco Nanni 1 Dino Pedreschi 1,2 2 University of Pisa Pisa, Italy pedre@di.unipi.it 1. APPLICATION SCENARIO This paper presents a real case study were several mobility data sources are collected in a urban context, integrated and analyzed in order to answer a set of key questions about mo- bility. The study of the human mobility is a very sensitive topic for both public transport (PT) companies and local administrations. This work is a contribution in the under- standing of some aspects of the mobility in Cosenza, a town in the South of Italy, and the realization of corresponding services in order to aswer to the following questions identi- fied in collaboration with the PT experts. Question 1: How is PT able to substitute private mobility? The objective is to compare public and private mobility to verify the capability of PT to satisfy the user mobility needs. Question 2: How different zones of the city are reach- able using PT? This question focuses on understanding how much different zones of the city are served by PT consider- ing different times of the day. Question 3: Are there usual time deviations between real travel times and official time tables? We want to verify if usual time deviations between real travel times and official time tables exist highlighting chronic delays in the service. Question 4: Can we spot visitors and commuters by their behavior? We aim at identifying important categories of people estimating their segmentation in order to evaluate the corresponding demand of services. For this case study we use data from Cosenza area: a GSM dataset 1 , a GPS dataset 2 , and data from the PT system 3 . GSM data contains 25 mln of phone calls made by about 350K distinct users from 15 October to 9 November 2012. GPS dataset contains about 1.5 mln of private vehicle tracks gathered in February-March and July-August 2012, while PT data consist of a set of GPS logs obtained by the on- 1 Wind Telecom S.p.a http://www.wind.it/ 2 Octotelematics S.p.a. http://www.octotelematics.com/ 3 Amaco S.p.a. http://www.amaco.it/ (c) 2014, Copyright is with the authors. Published in the Workshop Pro- ceedings of the EDBT/ICDT 2014 Joint Conference (March 28, 2014, Athens, Greece) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- by-nc-nd 4.0 board tracking system of the Cosenza’s PT and the PT offi- cial time table containing the scheduled times of the arrival of the buses at their stops. 2. METHODOLOGY AND RESULTS To answer the questions posed by the PT manager we devel- oped and implemented a set of methodologies and processes, and we integrated the corresponding services in M-Atlas [3], a larger mobility data analysis framework developed in our laboratory. For Question 1 we study the PT capabilities to replace the private mobility in a city. We use the GPS logs of the buses, a real time table computed starting from the real buses movements, and the GPS tracks of the private vehicles. We map the PT system to a spatio-temporal network, where nodes are bus stops labeled with name and position, while edges are the connections labeled with origin-destination stops and timestamp. Then, we map the GPS tracks on the PT network and we compute the shortest way to satisfy the users’ mobility using an agent-based algorithm that sim- ulates the human mobility in a network [1]. To evaluate the efficiency of the PT we compute the percentage of travels satisfied by the public transport considering a temporal and spatial tolerance (Coverage), and the distribution of delays accumulated by the user using the PT instead of the car (Distribution of time deviations). Using a maximum walk- ing distance of 2 km and applying a temporal constraint of 1 hour as maximum delay, we obtain that the percentage of the user’s car travels fully made by using PTs without taking more than 1 hour of extra time is 24%. If we further investi- gate the delay of the PTs travels w.r.t. the car ones, we find that the delay distribution is affected by the seasonality: in summer the average delay is 29 minutes (with a variance of 26), while in winter is 16 minutes (with a variance of 15). Going back to the trajectory data and extracting the starting points of the users which are not served by the pub- lic transport, we can discover which areas are disconnected from the network. By using a clustering algorithm on the starting points of GPS tracks that are not fully covered by the PT we identify two peripheral areas, one industrial and one residential, that are not reached by the bus service (Fig. 1). This result suggests the introduction of new lines or the addition of new bus stops to an existing line passing near those areas. This service is very effective in discover- ing the real needs of the population and how the network 363