Exploratory Trajectory Analysis for Massive Historical AIS Datasets Anita Graser *† , Melitta Dragaschnig * , Peter Widhalm * , Hannes Koller * and Norbert Br¨ andle * * AIT Austrian Institute of Technology, Vienna, Austria University of Salzburg, Salzburg, Austria Email: anita.graser@ait.ac.at ORCID: AG: 0000-0001-5361-2885, MD: 0000-0001-5100-2717, PW: 0000-0002-5074-5356, HK: 0000-0002-4255-3530, NB: 0000-0002-2976-3138 Abstract—Data exploration is an essential task for gaining an understanding of the potential and limitations of novel datasets. This paper discusses the challenges related to exploring large Automatic Identification System (AIS) datasets. We address these challenges using trajectory-based analysis approaches imple- mented in distributed computing environments using Spark and GeoMesa. This approach enables the exploration of datasets that are too big to handle within conventional spatial database systems. We demonstrate our approach using a case study of 4 billion AIS records. Index Terms—exploratory data analysis, mobility data, move- ment data, travel time, spatiotemporal I. I NTRODUCTION Massive ship movement datasets collected from the Au- tomatic Identification System (AIS) have the potential to improve maritime safety and efficiency of operations. Big AIS datasets can serve as input for machine learning approaches, for example, to extract ship routes and predict travel times [1] or future destinations [2]. The performance and therefore success of data-driven approaches depends strongly on the quality of the input data, that is the suitability of the data for a certain purpose. “Garbage in – garbage out” is a well known concept in computer science and mathematics. Therefore, it is necessary to evaluate data suitability. Exploratory data analysis (EDA) [3] analyzes data sets to determine what information the data contains. This is commonly achieved by summarizing the dataset’s main char- acteristics, often using visualizations. EDA goals are to assess assumptions and suggest hypotheses, to select statistical tools, and to provide a basis for further data collection if required. EDA concepts for movement data have been covered exten- sively by [4]. A. Problem statement The majority of existing movement data analysis methods cannot deal with large datasets since, in the past, movement data analysis had to deal with limited data availability. There- fore, traditional approaches quickly reach their limits as dataset size increases. Since the limits of existing tools for storing, processing, and visualizing movement data vary, there is no one clear definition of the term “massive” in the context of This work was supported by the Austrian Federal Ministry for Transport, In- novation and Technology (BMVIT) within the programme “IKT der Zukunft” under Grant 861258 (project MARNG). movement data analysis. However, the common denominator is that big or massive datasets cannot be handled by conventional tools on a single machine. To get a feeling for where the limits of conventional tools may lie, we refer to the literature. For example, [5] report a processing time of one day to create vessel tracks and density maps of 60 thousand AIS records (one month of AIS information from around Shetland) using ArcGIS. For a global ship density grid using 1.5 billion AIS records, [6] report a processing time of “about a week” using PostGIS and GDAL. These lengthy processing times limit the extent of data exploration that analysts can perform within a given time frame. Sampling is a common approach to reduce datasets to a size that can still be handled. However, it is hard to extract useful data samples for movement exploration tasks. To explore a dataset and assess preliminary assumptions about the data, it is necessary to be able to look at the whole dataset since sampling can reinforce assumptions. While the above mentioned challenges concern all move- ment datasets, maritime movement data presents additional challenges. Maritime vessel trip properties vary significantly, for example, regarding duration (from minutes to weeks) and spatial extent (from local to global). Furthermore, AIS data sources usually do not provide complete global coverage with- out gaps but are limited to certain regions. Therefore, vessel AIS tracks may contain long observation gaps. Additionally, reported information on movement speed and direction, trip destination [5], vessel identity [6], time stamps [7] and more are not always reliable. To address these challenges, it is necessary to develop dis- tributed computing approaches for maritime movement data. Developments in this direction include, for example [8]–[11]. B. Contribution There is a lack of established EDA tools as well as a lack of literature on best practices for applying EDA to movement data in general [12] and AIS data in particular. To address the current lack of best practices, this paper proposes concepts for the systematic exploration of large AIS datasets. We demonstrate these concepts using a case study of a dataset with 4 billion records. We cover analyses ranging from raw AIS records (II-A), to continuous vessel tracks (II-B) and, finally