Big Vehicular Traffic Data Mining: Towards
Accident and Congestion Prevention
Hamzah Al Najada , Imad Mahgoub
Computer & Electrical Engineering & Computer Science Department
Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA
Email: halnajada2014@fau.edu, mahgoubi@fau.edu
Abstract—In 2013, 32,719 people died in traffic crashes in the
USA. Almost 90 people on average lose their lives every day
and more than 250 are injured every hour. Road safety could
be enhanced by decreasing the traffic crashes. Traffic crashes
cause traffic congestion as well, which has become unbearable,
especially in mega-cities In addition, direct and indirect loss from
traffic congestion only is over $124 billion. The existence of the
Big Data of traffic crashes, as well as the availability of Big
Data analytics tools can help us gain useful insights to enhance
road safety and decrease traffic crashes. In this paper we use
H2O and WEKA mining tools. We apply the feature selection
techniques to find the most important predictors. In addition,
we tackle the problem of class imbalance by employing bagging
and using different quality measures. Furthermore, we evaluate
the performance of five classifiers to: (1) Conduct Big Data
analysis on a big traffic accidents dataset of 146322 examples, find
useful insight and patterns from the data, and forecast possible
accidents in advance (2) Conduct Big Data analysis on a big
vehicular casualties dataset of 194477 examples, to study the
driver’s behavior on the road. From the driver’s behavior mining
we can predict the driver age, sex as well as the accident severity.
The aforementioned analyses, can be used by decision makers and
practitioners to develop new traffic rules and policies, in order
to prevent accidents, and increase roadway safety.
Index Terms—Intelligent Transportation System (ITS), Vehic-
ular Ad-hoc Network (VANET), Traffic Engineering, Big Data,
Machine Learning and Data Mining.
I. I NTRODUCTION
Big Data is one of the most commonly used buzzwords
nowadays. This fame came from the voluminous growth
of domain-specific structured and unstructured information
in both of public and private organizations. This massive
amount of data, which has high volume, variety, velocity, and
complexity, can be used to extract useful information about
problems. Moreover, it is very important to consider the im-
portance of this Big Data since it is generated from everything
around us such as; web, smart phones, computer networks,
social networks, vehicular networks, transportation, and much
more. A formal definition of Big Data by International Data
Corporation (IDC): ”A new generation of technologies and
architectures designed to economically extract value from very
large volume of a wide variety of data, by enabling high-
velocity capture, discovery and/or analysis” [1]. Leveraging
the big traffic data provides a very good platform to develop
and enhance Intelligent Transportation Systems (ITSs). The
analysis of the large-scale data of transportation and accidents
has many potentials and it can give very useful insights from
the hidden relations in the data. The significance of this
work has been achieved by addressing and presenting the
following: (1) Mine and analyze large size of accident’s data,
to test its robustness and effectiveness (2) Select the most
relevant attributes, to reduce the processing time and increase
the prediction accuracy (3) Select the optimum classification
algorithms that have the lowest processing time with the
highest accuracy (4) Tackle the problem of class imbalance
distribution in both accidents and casualties datasets (5) Min-
ing and analyzing the drivers’ data since the drivers’ behavior
has a very strong impact and influence on the roadways.
In this work, we study two workbench big datasets, the
Accidents dataset which consists of 146322 examples and the
Casualties dataset, which consists of 194477 examples. We
selected these two datasets because they represent real-time
data. We trust using these data because they came out from
actual accidents on public roads that were reported to the po-
lice, and subsequently recorded, using the STATS19 accident
reporting form [2]. The advantage obtained from using the
first dataset is to find out the main causes of traffic accidents,
which mainly, causes traffic casualties and congestion, and we
aim to prevent both of them. The casualties dataset used to
study the human behavior effect on causing traffic accidents.
Human actions on vehicles or roads would significantly cause
different side effects. Human behavioral actions should be
studied thoroughly, because they have significant impacts on
traffic and roadways.
The remainder of this paper is organized as follows. Section II
presents the related work. Section III introduces challenges in
the proposed framework. Section IV presents the framework
for our experiments, including, the datasets, data processing,
and the feature selection techniques. Section V presents our
experimental results and analysis. Finally, conclusions and
future work are presented in Section VI.
II. RELATED WORK
A perspective analysis of traffic accidents has been per-
formed by Krishnaveni and Hemalatha [3] using Big Data
techniques. They utilized the Hong Kong’s Transportation
Department accidents data for the year of 2008, with a total
number of 34000 records. To obtain the causes of accidents, a
number of classification algorithms were used and their perfor-
mance compared in WEKA. The results presented that Ran-
dom Forest outperformed J48, NB, PART and AdaBoostM1
978-1-5090-0304-4/16/$31.00 ©2016 IEEE 256