Feature Selection using Misclassiﬁcation Counts Adil Bagirov 1 Andrew Yatsko 1 Andrew Stranieri 1 Herbert Jelinek 1,2 1 School of Science, Information Technology and Engineering, University of Ballarat, Ballarat, Victoria, 3353, Australia E-mail: a.stranieri@ballarat.edu.au, a.bagirov@ballarat.edu.au, andrewyatsko@students.ballarat.edu.au 2 School of Community Health Charles Sturt University, Albury, New South Wales, 2640, Australia Email: hjelinek@csu.edu.au Abstract Dimensionality reduction of the problem space through detection and removal of variables, contribut- ing little or not at all to classiﬁcation, is able to relieve the computational load and instance acquisition ef- fort, considering all the data attributes accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accu- mulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree to which diﬀerent variables exhibit ran- dom characteristics. The results are being veriﬁed us- ing the Nearest Neighbor classiﬁer. This also helps to address the feature irrelevance and redundancy, what ranking does not immediately decide. Additionally, feature ranking methods from diﬀerent independent sources are called in for the direct comparison. Keywords: classiﬁcation, feature ranking, feature se- lection, dimensionality reduction, optimization. 1 Introduction Supervised Classiﬁcation implies that unique associ- ation of instances with classes of data is known on the training stage for a data sample. This mapping is then used to develop an algorithm by which any new instance can be assigned to a correct class based on the data. A classiﬁcation algorithm has to be able to deal with computational complexity commonly caused by the magnitude of instances often driven by the multitude of data attributes. This problem is huge in text categorization, every word expanding the attribute space to a whole new dimension. This area received much attention in the past, but continues to be in the focus despite the processing power of computers has increased dramatically. Some terminology has settled over the time. (Saeys et al. 2007) give a contemporary view of feature selection methods in bioinformatics. Without knowing better, we can certainly assume that disengaging of variables, assumed all contribut- ing, will cause reduction of the classiﬁcation accuracy. We can stage experiments to ascertain inﬂuence of diﬀerent variables, referred commonly to as features, indirectly, via responses we get from a classiﬁer. Copyright c 2011, Australian Computer Society, Inc. This pa- per appeared at the 9th Australasian Data Mining Conference (AusDM 2011), Ballarat, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 121, Pe- ter Vamplew, Andrew Stranieri, Kok-Leong Ong, Peter Chris- ten and Paul Kennedy, Ed. Reproduction for academic, not-for proﬁt purposes permitted provided this text is included. Various models of feature-set are entered sequentially into the classiﬁer, no matter what kind, and the best response is learned. This generic technique of feature selection is called wrapping. In this work we use accuracy of the k-NN classiﬁer as an indirect measure of ﬁtness of feature-set. Where a pre-selection of features is possible, it is termed ﬁltering. Devices of diﬀerent sorts are in employ, and if they can provide answers to feature irrelevance and redundancy - whether features align with no class or their input is equivalent to others - the better. Filtering, which can be rather elaborate, is independent from the method of classiﬁcation, although it inevitably uses the class information. Information Gain and Relief are two ﬁl- tering techniques considered widely a standard, each coming from a diﬀerent perspective: probabilistic - the former, deterministic - the latter. Ultimately, there are methods of classiﬁcation, selecting features to best suit the class distribution for the tune-up. This is referred to as embedding. SVM is an example of classiﬁer where feature selection is embedded. We discuss these and other methods when comparing them to those introduced in this paper. Only wrapping oﬀers a universal approach for feature-set selection. A chosen set has to be con- sistent with the agenda of classiﬁcation, that is, be suﬃcient for class discrimination. The enumeration of diﬀerent subsets of features is computationally challenging. If monotonicity holds, so that any addition of a feature can only improve ﬁtness of the current set, the exhaustive search can be escaped via branch-and-bound arrangement setting a qualifying level for ﬁtness (Narendra and Fukunaga 1977). While same features may add diﬀerently to ﬁtness of diﬀerent sets, knowing ﬁtness of individual features can be useful. Embedding may or may not produce a shortlist of features best describing data as a whole. In SVM it does. In Decision Trees, instead, one best feature is selected for spawning at diﬀerent stages of tree growing (Quinlan 1993) with Information Gain often used as the criterion. The feature is diﬀerent for diﬀerent subsets of data included in subsequent branches of the tree. Globally or locally, it helps knowing how to rank features by relevance. While ranking of features can be obtained as a byproduct of feature subset selection by a wrapper or embedded method, often ranking is an element of design of ﬁlter methods. Conversely, having features ranked is attractive for quick assembly of a desired feature-set. For example, (Huda et al. 2010) pair a Neural Network wrapper with a ﬁlter, akin to Information Gain, to facilitate selection of suﬃcient quality a feature-set. Feature ranking is also the Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia 51