Muruga Critical Dimension in Data Mining Divya Suryakumar, Andrew H. Sung Department of Computer Science and Engineering New Mexico Institute of Mining and Technology Socorro, New Mexico 87801, USA divya|sung @cs.nmt.edu Qingzhong Liu Department of Computer Science Sam Houston State University Huntsville, Texas 77341, USA liu@shsu.edu Abstract - Data mining is an increasingly important means of knowledge acquisition for many applications in diverse fields such as biology, medicine, management, engineering, etc. When tackling a large-scale problem that involves a multitude of potentially relevant factors but lacking a precise formulation or mathematical characterization to allow formal approaches to solution, the available data collected for the application can often be mined to extract knowledge about the problem. Feature ranking and selection, thereby, are immediate issues to consider when one prepares to perform data mining, and the literature contains numerous theoretical and empirical methods of feature selection for a variety of problems. This work in progress paper concerns the related question of critical dimension, i.e., for a specific data mining task, does there exist a minimum number (of features) which is required for a specific learning machine to achieve satisfactory performance? As a first step in addressing this question, a simple ad-hoc method is employed for experiment and it is shown that the phenomenon of critical dimension indeed exists for several of the datasets studied. The implications are that each of these datasets contains irrelevant features or input attributes, which can be eliminated to achieve higher accuracy in model building using learning machines. Keywords-feature selection; critical dimension; machine learning. I. INTRODUCTION Data mining is aimed at extracting useful information or knowledge from datasets; to achieve this goal, feature selection is often necessary to eliminate lesser or insignificant features in order to reduce the size of the dataset and to facilitate model building (e.g., using learning machines) for knowledge extraction. Many methods have been proposed for feature selection [1]. The interesting fact about extracted features are that sometimes not all extracted features are individually useful; however, correlation of features itself an intriguing question. We may use learning machines to find feature correlation or to discover important or relevant features. Some theoretically optimal criteria could become practically intractable [2]. The ultimate, guaranteed optimal feature selection method requires exhaustive analysis of all possible subsets of features; this is infeasible for datasets with a large number of features; so, the next best goal is to find a satisfactory set of subsets. Feature selection is usually done in two different ways, namely subset selection or entropy- based selection and feature ranking. Feature ranking uses ranking algorithms which scores all features using certain metrics and ranks them accordingly [3]. A subset selection method uses an algorithm to find a best possible subset in arbitrary time. Here, the term best possible subset refers to the best subset found among satisfactory set of subsets [4]. II. FEATURE RANKING The main objective of feature selection is to improve the prediction performance or accuracy, to provide faster and cost-effective predictors and understand the correlation among data [5]. For our experiments, we use both feature selection and subset selection. A supervised ‘Chi-squared Ranking Filter’ [6] and a supervised ‘Support Vector Machines (SVM) feature evaluator’ [7] method are used for ranking features. A ‘Ranker’ search method ranks attributes according to their relevance and individual evaluations. Using Ranker we can set the threshold to reduce the attribute set to consider or also specify the set of attributes to ignore; hence it is comfortable for our experiments to eliminate some unwanted features. The Chi-squared Ranking Filter evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class. It is a statistical test to find the independence of two events for goodness of fit of an observed distribution to a theoretical one whose value is in zero to infinity range and cannot be negative. SVM feature evaluator evaluates the worth of an attribute by using an SVM classifier. Attributes are ranked by the square of the weight assigned by the SVM feature evaluator. Attribute selection for multiclass problems is handled by ranking attributes for each class separately using a one to all method and then dealing from the top of each pile to give a final ranking. To find the best feature subset, we use supervised CFS Subset Evaluator method and a greedy stepwise search algorithm. The algorithm evaluates the worthiness of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low inter-correlation are preferred [8][9]. The two feature selection methods discussed above are the most widely used methods but there could always be that one subset which is the best feature subset or the correlation among a certain low ranked features could increase the 97 Copyright (c) IARIA, 2012. ISBN: 978-1-61208-181-6 eKNOW 2012 : The Fourth International Conference on Information, Process, and Knowledge Management