ARTICLE IN PRESS
JID: KNOSYS [m5G;June 29, 2017;21:32]
Knowledge-Based Systems 000 (2017) 1–14
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Ensemble correlation-based low-rank matrix completion with
applications to traffic data imputation
Xiaobo Chen
a,∗
, Zhongjie Wei
b
, Zuoyong Li
c
, Jun Liang
a
, Yingfeng Cai
a
, Bob Zhang
d
a
Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China
b
School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China
c
Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou 350108, China
d
Department of Computer and Information Science, University of Macau, Macau, China
a r t i c l e i n f o
Article history:
Received 24 August 2016
Revised 4 June 2017
Accepted 6 June 2017
Available online xxx
Keywords:
Missing data
Low-rank matrix completion
Nearest neighbor
Pearson’s correlation
Ensemble learning
a b s t r a c t
Low-rank matrix completion (LRMC) is a recently emerging technique which has achieved promising per-
formance in many real-world applications, such as traffic data imputation. In order to estimate missing
values, the current LRMC based methods optimize the rank of the matrix comprising the whole traffic
data, potentially assuming that all traffic data is equally important. As a result, it puts more emphasis on
the commonality of traffic data while ignoring its subtle but crucial difference due to different locations
of loop detectors as well as dates of sampling. To handle this problem and further improve imputation
performance, a novel correlation-based LRMC method is proposed in this paper. Firstly, LRMC is applied
to get initial estimations of missing values. Then, a distance matrix containing pairwise distance between
samples is built based on a weighted Pearson’s correlation which strikes a balance between observed
values and imputed values. For a specific sample, its most similar samples based on the distance matrix
constructed are chosen by using an adaptive K-nearest neighboring (KNN) search. LRMC is then applied
on these samples with much stronger correlation to obtain refined estimations of missing values. Finally,
we also propose a simple but effective ensemble learning strategy to integrate multiple imputed values
for a specific sample for further improving imputation performance. Extensive numerical experiments are
performed on both traffic flow volume data as well as standard benchmark datasets. The results confirm
that the proposed correlation-based LRMC and its ensemble learning version achieve better imputation
performance than competing methods.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction
Intelligent Transportation System (ITS) is an effective way to
alleviate traffic congestion and improve transportation efficiency.
It is an integrated comprehensive system, which synthesizes a
variety of technologies, including information, computer, data
communication, sensor, electronic control, automatic control the-
ory, operations research, and artificial intelligence. Data is one
of the most important factors for intelligent transportation sys-
tem, where the most popular parameters include average speed,
real-time vehicle volume, average lane occupancy rate, etc. By
collecting and analyzing massive amounts of traffic data, ITS can
manage and predict better. For example, it can (1) find traffic
anomalies quickly and make traffic management convenient, (2)
discover inherent rules and knowledge from traffic data, so as
to improve the operational efficiency of traffic management and
∗
Corresponding author.
E-mail address: xbchen82@gmail.com (X. Chen).
road traffic capacity. Based on the predicted traffic flow a few
hours ahead, users are able to adjust their route plans in advance
in order to avoid congested roads. Thus, traffic data will play a
fundamental role in the construction of ITS.
In actual traffic environment, the data collected by traffic
equipment, e.g., loop detectors, are usually not completed where
many missing values may occur because of a variety of reasons,
such as the failures of loop detectors or transmission network.
In the case of incomplete traffic data, it is insufficient to express
traffic information accurately. More importantly, it prevents the
applications of many classic data mining algorithms, such as
support vector machine [1–3], neural networks [4], sparse learning
[5], etc., because these algorithms generally require a complete
set of data. Therefore, the imputation of missing values in a loop
detection system is of great value.
Besides traffic data, missing values also occur frequently in
other real-life processes, such as physical measurements, commer-
cial surveys, business reports, etc. In the data mining and machine
learning community, missing values imputation has attracted
http://dx.doi.org/10.1016/j.knosys.2017.06.010
0950-7051/© 2017 Elsevier B.V. All rights reserved.
Please cite this article as: X. Chen et al., Ensemble correlation-based low-rank matrix completion with applications to traffic data impu-
tation, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.06.010