Learn ++ .MF: A random subspace approach for the missing feature problem Robi Polikar a,n , Joseph DePasquale a , Hussein Syed Mohammed a , Gavin Brown b , Ludmilla I. Kuncheva c a Electrical and Computer Eng., Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028, USA b University of Manchester, Manchester, England, UK c University of Bangor, Bangor, Wales, UK article info Article history: Received 9 November 2009 Received in revised form 16 April 2010 Accepted 21 May 2010 Keywords: Missing data Missing features Ensemble of classiﬁers Random subspace method abstract We introduce Learn ++ .MF, an ensemble-of-classiﬁers based algorithm that employs random subspace selection to address the missing feature problem in supervised classiﬁcation. Unlike most established approaches, Learn ++ .MF does not replace missing values with estimated ones, and hence does not need speciﬁc assumptions on the underlying data distribution. Instead, it trains an ensemble of classiﬁers, each on a random subset of the available features. Instances with missing values are classiﬁed by the majority voting of those classiﬁers whose training data did not include the missing features. We show that Learn ++ .MF can accommodate substantial amount of missing data, and with only gradual decline in performance as the amount of missing data increases. We also analyze the effect of the cardinality of the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the conditions under which the proposed approach is most effective. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction 1.1. The missing feature problem The integrity and completeness of data are essential for any classiﬁcation algorithm. After all, a trained classiﬁer – unless speciﬁcally designed to address this issue – cannot process instances with missing features, as the missing number(s) in the input vectors would make the matrix operations involved in data processing impossible. To obtain a valid classiﬁcation, the data to be classiﬁed should be complete with no missing features (henceforth, we use missing data and missing features inter- changeably). Missing data in real world applications is not uncommon: bad sensors, failed pixels, unanswered questions in surveys, malfunctioning equipment, medical tests that cannot be administered under certain conditions, etc. are all familiar scenarios in practice that can result in missing features. Feature values that are beyond the expected dynamic range of the data due to extreme noise, signal saturation, data corruption, etc. can also be treated as missing data. Furthermore, if the entire data are not acquired under identical conditions (time/location, using the same equipment, etc.), different data instances may be missing different features. Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8  8 grid, creating 64 features, f 1 –f 64 , a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each case. Having such a large proportion of randomly varying features may be viewed as an extreme and unlikely scenario, warranting reacquisition of the entire dataset. However, data reacquisition is often expensive, impractical, or sometimes even impossible, justifying the need for an alternative practical solution. The classiﬁcation algorithm described in this paper is designed to provide such a practical solution, accommodating missing features subject to the condition of distributed redundancy (discussed in Section 3), which is satisﬁed surprisingly often in practice. 1.2. Current techniques for accommodating missing data The simplest approach for dealing with missing data is to ignore those instances with missing attributes. Commonly referred to as ﬁltering or list wise deletion approaches, such techniques are clearly suboptimal when a large portion of the data have missing attributes [1], and of course are infeasible, if each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2–5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach include replacing the missing value with one of the existing data points (most similar in some measure) as in hot – deck imputation ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.05.028 n Corresponding author: Tel: + 1 856 256 5372; fax: + 1 856 256 5241. E-mail address: polikar@rowan.edu (R. Polikar). Please cite this article as: R. Polikar, et al., Learn ++ .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028 Pattern Recognition ] (]]]]) ]]]–]]]