Feature Selection using Misclassification Counts Adil Bagirov 1 Andrew Yatsko 1 Andrew Stranieri 1 Herbert Jelinek 1,2 1 School of Science, Information Technology and Engineering, University of Ballarat, Ballarat, Victoria, 3353, Australia E-mail: a.stranieri@ballarat.edu.au, a.bagirov@ballarat.edu.au, andrewyatsko@students.ballarat.edu.au 2 School of Community Health Charles Sturt University, Albury, New South Wales, 2640, Australia Email: hjelinek@csu.edu.au Abstract Dimensionality reduction of the problem space through detection and removal of variables, contribut- ing little or not at all to classification, is able to relieve the computational load and instance acquisition ef- fort, considering all the data attributes accessed each time around. The approach to feature selection in this paper is based on the concept of coherent accu- mulation of data about class centers with respect to coordinates of informative features. Ranking is done on the degree to which different variables exhibit ran- dom characteristics. The results are being verified us- ing the Nearest Neighbor classifier. This also helps to address the feature irrelevance and redundancy, what ranking does not immediately decide. Additionally, feature ranking methods from different independent sources are called in for the direct comparison. Keywords: classification, feature ranking, feature se- lection, dimensionality reduction, optimization. 1 Introduction Supervised Classification implies that unique associ- ation of instances with classes of data is known on the training stage for a data sample. This mapping is then used to develop an algorithm by which any new instance can be assigned to a correct class based on the data. A classification algorithm has to be able to deal with computational complexity commonly caused by the magnitude of instances often driven by the multitude of data attributes. This problem is huge in text categorization, every word expanding the attribute space to a whole new dimension. This area received much attention in the past, but continues to be in the focus despite the processing power of computers has increased dramatically. Some terminology has settled over the time. (Saeys et al. 2007) give a contemporary view of feature selection methods in bioinformatics. Without knowing better, we can certainly assume that disengaging of variables, assumed all contribut- ing, will cause reduction of the classification accuracy. We can stage experiments to ascertain influence of different variables, referred commonly to as features, indirectly, via responses we get from a classifier. Copyright c 2011, Australian Computer Society, Inc. This pa- per appeared at the 9th Australasian Data Mining Conference (AusDM 2011), Ballarat, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 121, Pe- ter Vamplew, Andrew Stranieri, Kok-Leong Ong, Peter Chris- ten and Paul Kennedy, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included. Various models of feature-set are entered sequentially into the classifier, no matter what kind, and the best response is learned. This generic technique of feature selection is called wrapping. In this work we use accuracy of the k-NN classifier as an indirect measure of fitness of feature-set. Where a pre-selection of features is possible, it is termed filtering. Devices of different sorts are in employ, and if they can provide answers to feature irrelevance and redundancy - whether features align with no class or their input is equivalent to others - the better. Filtering, which can be rather elaborate, is independent from the method of classification, although it inevitably uses the class information. Information Gain and Relief are two fil- tering techniques considered widely a standard, each coming from a different perspective: probabilistic - the former, deterministic - the latter. Ultimately, there are methods of classification, selecting features to best suit the class distribution for the tune-up. This is referred to as embedding. SVM is an example of classifier where feature selection is embedded. We discuss these and other methods when comparing them to those introduced in this paper. Only wrapping offers a universal approach for feature-set selection. A chosen set has to be con- sistent with the agenda of classification, that is, be sufficient for class discrimination. The enumeration of different subsets of features is computationally challenging. If monotonicity holds, so that any addition of a feature can only improve fitness of the current set, the exhaustive search can be escaped via branch-and-bound arrangement setting a qualifying level for fitness (Narendra and Fukunaga 1977). While same features may add differently to fitness of different sets, knowing fitness of individual features can be useful. Embedding may or may not produce a shortlist of features best describing data as a whole. In SVM it does. In Decision Trees, instead, one best feature is selected for spawning at different stages of tree growing (Quinlan 1993) with Information Gain often used as the criterion. The feature is different for different subsets of data included in subsequent branches of the tree. Globally or locally, it helps knowing how to rank features by relevance. While ranking of features can be obtained as a byproduct of feature subset selection by a wrapper or embedded method, often ranking is an element of design of filter methods. Conversely, having features ranked is attractive for quick assembly of a desired feature-set. For example, (Huda et al. 2010) pair a Neural Network wrapper with a filter, akin to Information Gain, to facilitate selection of sufficient quality a feature-set. Feature ranking is also the Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia 51