ARTICLE IN PRESS JID: KNOSYS [m5G;June 29, 2017;21:32] Knowledge-Based Systems 000 (2017) 1–14 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation Xiaobo Chen a, , Zhongjie Wei b , Zuoyong Li c , Jun Liang a , Yingfeng Cai a , Bob Zhang d a Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China b School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China c Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou 350108, China d Department of Computer and Information Science, University of Macau, Macau, China a r t i c l e i n f o Article history: Received 24 August 2016 Revised 4 June 2017 Accepted 6 June 2017 Available online xxx Keywords: Missing data Low-rank matrix completion Nearest neighbor Pearson’s correlation Ensemble learning a b s t r a c t Low-rank matrix completion (LRMC) is a recently emerging technique which has achieved promising per- formance in many real-world applications, such as traffic data imputation. In order to estimate missing values, the current LRMC based methods optimize the rank of the matrix comprising the whole traffic data, potentially assuming that all traffic data is equally important. As a result, it puts more emphasis on the commonality of traffic data while ignoring its subtle but crucial difference due to different locations of loop detectors as well as dates of sampling. To handle this problem and further improve imputation performance, a novel correlation-based LRMC method is proposed in this paper. Firstly, LRMC is applied to get initial estimations of missing values. Then, a distance matrix containing pairwise distance between samples is built based on a weighted Pearson’s correlation which strikes a balance between observed values and imputed values. For a specific sample, its most similar samples based on the distance matrix constructed are chosen by using an adaptive K-nearest neighboring (KNN) search. LRMC is then applied on these samples with much stronger correlation to obtain refined estimations of missing values. Finally, we also propose a simple but effective ensemble learning strategy to integrate multiple imputed values for a specific sample for further improving imputation performance. Extensive numerical experiments are performed on both traffic flow volume data as well as standard benchmark datasets. The results confirm that the proposed correlation-based LRMC and its ensemble learning version achieve better imputation performance than competing methods. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Intelligent Transportation System (ITS) is an effective way to alleviate traffic congestion and improve transportation efficiency. It is an integrated comprehensive system, which synthesizes a variety of technologies, including information, computer, data communication, sensor, electronic control, automatic control the- ory, operations research, and artificial intelligence. Data is one of the most important factors for intelligent transportation sys- tem, where the most popular parameters include average speed, real-time vehicle volume, average lane occupancy rate, etc. By collecting and analyzing massive amounts of traffic data, ITS can manage and predict better. For example, it can (1) find traffic anomalies quickly and make traffic management convenient, (2) discover inherent rules and knowledge from traffic data, so as to improve the operational efficiency of traffic management and Corresponding author. E-mail address: xbchen82@gmail.com (X. Chen). road traffic capacity. Based on the predicted traffic flow a few hours ahead, users are able to adjust their route plans in advance in order to avoid congested roads. Thus, traffic data will play a fundamental role in the construction of ITS. In actual traffic environment, the data collected by traffic equipment, e.g., loop detectors, are usually not completed where many missing values may occur because of a variety of reasons, such as the failures of loop detectors or transmission network. In the case of incomplete traffic data, it is insufficient to express traffic information accurately. More importantly, it prevents the applications of many classic data mining algorithms, such as support vector machine [1–3], neural networks [4], sparse learning [5], etc., because these algorithms generally require a complete set of data. Therefore, the imputation of missing values in a loop detection system is of great value. Besides traffic data, missing values also occur frequently in other real-life processes, such as physical measurements, commer- cial surveys, business reports, etc. In the data mining and machine learning community, missing values imputation has attracted http://dx.doi.org/10.1016/j.knosys.2017.06.010 0950-7051/© 2017 Elsevier B.V. All rights reserved. Please cite this article as: X. Chen et al., Ensemble correlation-based low-rank matrix completion with applications to traffic data impu- tation, Knowledge-Based Systems (2017), http://dx.doi.org/10.1016/j.knosys.2017.06.010