ARTICLE IN PRESS JID: KNOSYS [m5G;June 29, 2017;21:32] Knowledge-Based Systems 000 (2017) 1–14 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys Ensemble correlation-based low-rank matrix completion with applications to traﬃc data imputation Xiaobo Chen a,∗ , Zhongjie Wei b , Zuoyong Li c , Jun Liang a , Yingfeng Cai a , Bob Zhang d a Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China b School of Automotive and Traﬃc Engineering, Jiangsu University, Zhenjiang 212013, China c Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou 350108, China d Department of Computer and Information Science, University of Macau, Macau, China a r t i c l e i n f o Article history: Received 24 August 2016 Revised 4 June 2017 Accepted 6 June 2017 Available online xxx Keywords: Missing data Low-rank matrix completion Nearest neighbor Pearson’s correlation Ensemble learning a b s t r a c t Low-rank matrix completion (LRMC) is a recently emerging technique which has achieved promising per- formance in many real-world applications, such as traﬃc data imputation. In order to estimate missing values, the current LRMC based methods optimize the rank of the matrix comprising the whole traﬃc data, potentially assuming that all traﬃc data is equally important. As a result, it puts more emphasis on the commonality of traﬃc data while ignoring its subtle but crucial difference due to different locations of loop detectors as well as dates of sampling. To handle this problem and further improve imputation performance, a novel correlation-based LRMC method is proposed in this paper. Firstly, LRMC is applied to get initial estimations of missing values. Then, a distance matrix containing pairwise distance between samples is built based on a weighted Pearson’s correlation which strikes a balance between observed values and imputed values. For a speciﬁc sample, its most similar samples based on the distance matrix constructed are chosen by using an adaptive K-nearest neighboring (KNN) search. LRMC is then applied on these samples with much stronger correlation to obtain reﬁned estimations of missing values. Finally, we also propose a simple but effective ensemble learning strategy to integrate multiple imputed values for a speciﬁc sample for further improving imputation performance. Extensive numerical experiments are performed on both traﬃc ﬂow volume data as well as standard benchmark datasets. The results conﬁrm that the proposed correlation-based LRMC and its ensemble learning version achieve better imputation performance than competing methods. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Intelligent Transportation System (ITS) is an effective way to alleviate traﬃc congestion and improve transportation eﬃciency. It is an integrated comprehensive system, which synthesizes a variety of technologies, including information, computer, data communication, sensor, electronic control, automatic control the- ory, operations research, and artiﬁcial intelligence. Data is one of the most important factors for intelligent transportation sys- tem, where the most popular parameters include average speed, real-time vehicle volume, average lane occupancy rate, etc. By collecting and analyzing massive amounts of traﬃc data, ITS can manage and predict better. For example, it can (1) ﬁnd traﬃc anomalies quickly and make traﬃc management convenient, (2) discover inherent rules and knowledge from traﬃc data, so as to improve the operational eﬃciency of traﬃc management and ∗ Corresponding author. E-mail address: xbchen82@gmail.com (X. Chen). road traﬃc capacity. Based on the predicted traﬃc ﬂow a few hours ahead, users are able to adjust their route plans in advance in order to avoid congested roads. Thus, traﬃc data will play a fundamental role in the construction of ITS. In actual traﬃc environment, the data collected by traﬃc equipment, e.g., loop detectors, are usually not completed where many missing values may occur because of a variety of reasons, such as the failures of loop detectors or transmission network. In the case of incomplete traﬃc data, it is insuﬃcient to express traﬃc information accurately. More importantly, it prevents the applications of many classic data mining algorithms, such as support vector machine [1–3], neural networks [4], sparse learning [5], etc., because these algorithms generally require a complete set of data. Therefore, the imputation of missing values in a loop detection system is of great value. Besides traﬃc data, missing values also occur frequently in other real-life processes, such as physical measurements, commer- cial surveys, business reports, etc. In the data mining and machine learning community, missing values imputation has attracted http://dx.doi.org/10.1016/j.knosys.2017.06.010 0950-7051/© 2017 Elsevier B.V. All rights reserved. Please cite this article as: X. Chen et al., Ensemble correlation-based low-rank matrix completion with applications to traﬃc data impu- tation, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.06.010