Big Vehicular Trafﬁc Data Mining: Towards Accident and Congestion Prevention Hamzah Al Najada , Imad Mahgoub Computer & Electrical Engineering & Computer Science Department Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA Email: halnajada2014@fau.edu, mahgoubi@fau.edu Abstract—In 2013, 32,719 people died in trafﬁc crashes in the USA. Almost 90 people on average lose their lives every day and more than 250 are injured every hour. Road safety could be enhanced by decreasing the trafﬁc crashes. Trafﬁc crashes cause trafﬁc congestion as well, which has become unbearable, especially in mega-cities In addition, direct and indirect loss from trafﬁc congestion only is over $124 billion. The existence of the Big Data of trafﬁc crashes, as well as the availability of Big Data analytics tools can help us gain useful insights to enhance road safety and decrease trafﬁc crashes. In this paper we use H2O and WEKA mining tools. We apply the feature selection techniques to ﬁnd the most important predictors. In addition, we tackle the problem of class imbalance by employing bagging and using different quality measures. Furthermore, we evaluate the performance of ﬁve classiﬁers to: (1) Conduct Big Data analysis on a big trafﬁc accidents dataset of 146322 examples, ﬁnd useful insight and patterns from the data, and forecast possible accidents in advance (2) Conduct Big Data analysis on a big vehicular casualties dataset of 194477 examples, to study the driver’s behavior on the road. From the driver’s behavior mining we can predict the driver age, sex as well as the accident severity. The aforementioned analyses, can be used by decision makers and practitioners to develop new trafﬁc rules and policies, in order to prevent accidents, and increase roadway safety. Index Terms—Intelligent Transportation System (ITS), Vehic- ular Ad-hoc Network (VANET), Trafﬁc Engineering, Big Data, Machine Learning and Data Mining. I. I NTRODUCTION Big Data is one of the most commonly used buzzwords nowadays. This fame came from the voluminous growth of domain-speciﬁc structured and unstructured information in both of public and private organizations. This massive amount of data, which has high volume, variety, velocity, and complexity, can be used to extract useful information about problems. Moreover, it is very important to consider the im- portance of this Big Data since it is generated from everything around us such as; web, smart phones, computer networks, social networks, vehicular networks, transportation, and much more. A formal deﬁnition of Big Data by International Data Corporation (IDC): ”A new generation of technologies and architectures designed to economically extract value from very large volume of a wide variety of data, by enabling high- velocity capture, discovery and/or analysis” [1]. Leveraging the big trafﬁc data provides a very good platform to develop and enhance Intelligent Transportation Systems (ITSs). The analysis of the large-scale data of transportation and accidents has many potentials and it can give very useful insights from the hidden relations in the data. The signiﬁcance of this work has been achieved by addressing and presenting the following: (1) Mine and analyze large size of accident’s data, to test its robustness and effectiveness (2) Select the most relevant attributes, to reduce the processing time and increase the prediction accuracy (3) Select the optimum classiﬁcation algorithms that have the lowest processing time with the highest accuracy (4) Tackle the problem of class imbalance distribution in both accidents and casualties datasets (5) Min- ing and analyzing the drivers’ data since the drivers’ behavior has a very strong impact and inﬂuence on the roadways. In this work, we study two workbench big datasets, the Accidents dataset which consists of 146322 examples and the Casualties dataset, which consists of 194477 examples. We selected these two datasets because they represent real-time data. We trust using these data because they came out from actual accidents on public roads that were reported to the po- lice, and subsequently recorded, using the STATS19 accident reporting form [2]. The advantage obtained from using the ﬁrst dataset is to ﬁnd out the main causes of trafﬁc accidents, which mainly, causes trafﬁc casualties and congestion, and we aim to prevent both of them. The casualties dataset used to study the human behavior effect on causing trafﬁc accidents. Human actions on vehicles or roads would signiﬁcantly cause different side effects. Human behavioral actions should be studied thoroughly, because they have signiﬁcant impacts on trafﬁc and roadways. The remainder of this paper is organized as follows. Section II presents the related work. Section III introduces challenges in the proposed framework. Section IV presents the framework for our experiments, including, the datasets, data processing, and the feature selection techniques. Section V presents our experimental results and analysis. Finally, conclusions and future work are presented in Section VI. II. RELATED WORK A perspective analysis of trafﬁc accidents has been per- formed by Krishnaveni and Hemalatha [3] using Big Data techniques. They utilized the Hong Kong’s Transportation Department accidents data for the year of 2008, with a total number of 34000 records. To obtain the causes of accidents, a number of classiﬁcation algorithms were used and their perfor- mance compared in WEKA. The results presented that Ran- dom Forest outperformed J48, NB, PART and AdaBoostM1 978-1-5090-0304-4/16/$31.00 ©2016 IEEE 256