This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1 Estimating Inefficiency in Bus Trip Choices from a User Perspective With Schedule, Positioning, and Ticketing Data Tarciso Braz , Matheus Maciel, Demetrio Gomes Mestre, Nazareno Andrade, Carlos Eduardo Pires, Andreza Raquel Queiroz, and Veruska Borges Santos Abstract—The availability of historical data on the global positioning systems’ trajectories of vehicles and passenger board- ing information for public bus fleets of large municipalities has given researchers and practitioners the opportunity to explore new challenges regarding the analysis of public transportation systems. This paper performs one such analysis as a case study examining the margin of improvement that passengers of a 1.8M people Brazilian city have when choosing their daily bus trips. In doing so, we document a number of not readily apparent chal- lenges that must be overcome to leverage public transportation big data to policymakers, transportation systems operators, and citizens. Solutions are devised to each of these challenges and demonstrated on the analysis of the aforementioned 1.8M people city. Index Terms— Public transportation, transit usage performance improvement, map-matching, origin-destination estimation. I. I NTRODUCTION I NTELLIGENT transportation systems, and in particular Traveler Information Systems, have the potential to opti- mize transit trips according to user preferences and restrictions. Indeed, a number of systems have been proposed and are daily used by millions of transport users worldwide, such as Google Maps 1 and Moovit. 2 Nevertheless, although there has been a constant push for improving algorithms that predict trip time or comfort, there has been comparatively little effort on estimating the current margin for improvement that such algorithms can attain at scale and in naturalistic settings. It is possible that present systems are already close to a performance ceiling given the Manuscript received December 14, 2017; revised March 30, 2018; accepted May 15, 2018. This work was supported by EUBra-BIGSEA, a Research and Innovation Action, funded in part by the European Commission through the Cooperation Programme, Horizon 2020, under Grant 690116, and in part by the Ministério de Ciência, Tecnologia e Inovação, RNP/Brazil, under Grant GA-0000000650/04. The Associate Editor for this paper was R. Nair. (Corresponding author: Tarciso Braz.) The authors are with the Systems and Computing Department, Universidade Federal de Campina Grande, Campina Grande 58429-900, Brazil (e-mail: tarcisocomp@gmail.com; teu.araujo@gmail.com; demetriogm@gmail.com; nazareno@computacao.ufcg.edu.br; cesp@computacao.ufcg.edu.br; andreza. queiroz@ccc.ufcg.edu.br; veruska.santos@ccc.ufcg.edu.br). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITS.2018.2846036 1 https://www.google.com/maps 2 https://www.moovitapp.com/ actual choices available in a city. In other words, it is possible that transit users typically choose their optimal trips in their routine. If this is the case, there may be more efficient uses of research and development efforts than trying to improve the effectiveness of Traveler Information Systems. If the contrary is true, it would be useful for the operators and community to have user-centric information (e.g. [1]) in order to understand to what degree different types of routes, moments or users have inefficiency in the trips taken by passengers as part of their daily travel behavior. In this context, the present work contributes to fill two gaps in the literature. First, it performs a citywide analysis of the efficiency of choices made by bus users in the transit system of Curitiba, a 1.8M-people city in Brazil. Efficiency is measured as how close choices made by transit users are to the optimal choice available for their trip with respect to trip duration. This analysis leverages historical data from the whole of the bus system, integrated with ticketing and schedule data. The second contribution of this work is related to document- ing and addressing difficulties for integrating and leveraging historical transport data to perform one such analysis. Irre- spective of recent advances in the availability and formats for sharing transport data between transportation companies and the government or citizens, the formats and inconsistencies presently prevalent in historical transport data pose a number of challenges for (i) examining a transport system at trip level using vehicle location data, (ii) estimating boarding position when automatic fare collection data is available, and (iii) inferring user trip destination from the combination of vehicle location and ticketing data. This work documents and discusses these challenges as observed in data from multiple cities, puts forward open solutions to these challenges, and evaluates such solutions. The remainder of this paper is organized as follows. Section II describes the three data sources used in this work and the challenges usually present on the integration of these data sources. Section III details the Curitiba bus system and the data collected to be used in this work. All solutions used to solve data integration challenges are detailed in Sections IV, V and VI. Our experiment to quantify inefficiency in user trip choice and its results are presented in Section VII. Section VIII reviews previous works. Finally, conclusions and future directions are discussed in Section IX. 1524-9050 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.