Data dimensionality reduction based on genetic selection of feature subsets K.M. Faraoun 1 , A. Rabhi 2 1 Evolutionary Engineering and Distributed Information Systems Laboratory, EEDIS UDL Unixversity- SBA, 22000 - Algeria Kamel_mh@yahoo.fr 2 Laboratoire des mathématiques, UDL University, SBA, 22000 - Algeria rabhi_abbes@yahoo.fr Abstract. In the present paper, we show that a multi-classification process can be significantly enhanced by selecting an optimal set of the features used as input for the training operation. The selection of such a subset will reduce the dimensionality of the data samples and eliminate the redundancy and ambiguity introduced by some attributes. The used classifier can then operate only on the selected features to perform the learning process. A genetic search is used here to explore the set of all possible features subsets whose size is exponentially proportional to the number of features. A new measure is proposed to compute the information gain provided by each features subsets, and used as the fitness function of the genetic search. Experiments are performed using the KDD99 dataset to classify DoS network intrusions, according to the 41 existing features. The optimality of the obtained features subset is then tested using a multi-layered neural network. Obtained results show that the proposed approach can enhance both the classification rate and the learning runtime. Keywords: Features selection, genetic algorithms, patterns classification. (Received October 10, 2006 / Accepted January 03, 2007) 1 Introduction Pattern recognition relies on the extraction and selection of features that adequately characterize the objects of interest. The task of identifying the features that perform well in a classification algorithm is a difficult one, and the optimal choice can be non-intuitive; features that perform poorly separately can often prevail when paired with other features [1]. The filter approach [2] to feature selection tries to infer which features will work well for the classification algorithm by drawing conclusions from the observed distributions (histograms) of the individual features. However, the histograms give little insight into the separation between polyps and non-polyps. The correlation structure of the data is responsible for the success of the joint classifier, and a good classification scheme will attempt to utilize this structure. Another technique, known as wrapper feature selection [3], uses the method of classification itself to measure the importance of a feature or features set. The goal in this approach is maximizing the predicted classification accuracy. This approach, while more computationally expensive, tends to provide better results than the simpler filter methods. Recent work in the field of pattern recognition explores the use of evolutionary algorithms for feature selection, and genetic algorithms (GAs) are one type of evolutionary algorithm that can be used effectively as engines for solving the features selection problem. Features selection using genetic algorithms has been studied and proven effective in conjunction with various other classifiers. Most of the existing works are focused on the wrapper mode using different classifier method (Neural networks, SVM, K-NN…..), and the same binary chromosomes representation is generally used. A binary string represents the set of all existing features, with a value of 1 at the i th position if the i th feature is selected, and 0 otherwise. The advantage of this representation is that a standard and well understood GA could be used without any modification. Unfortunately, the model of chromosome is only appropriate for data that have small