IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.3A, March 2006 230 Feature Extraction using Fuzzy C - Means Clustering for Data Mining Systems Srinivasa K G * , Venugopal K R 1 and L M Patnaik 2 * Data Mining Laboratory, MS Ramaiah Institute of Technology, Bangalore, India. 1 Department of CSE, University Visvesvaraya College of Engineering, Bangalore University, Bangalore, India. 2 Microprocessor Applications Laboratory, Indian Institute of Science, Bangalore. Abstract Knowledge Discovery and Data Mining(KDD) process includes preprocessing, transformation, data mining and knowledge extraction. The two important tasks of data mining are clustering and classification. In this paper, we propose a generic feature extraction for classification using Fuzzy C-Means(FCM) clustering. The raw data is preprocessed, normalized and then data points are clustered using fuzzy c-means technique. Feature vectors for all the classes are generated by extracting the most relevant features from the corresponding clusters and used for further classification. Artificial Neural Network and Support Vector Machines are used to perform the classification task. Experiments are conducted on four datasets and the accuracy obtained by performing specific feature extraction for a particular data set is compared with generic feature extraction scheme. The algorithm performs relatively well with respect to classification results when compared with the specific feature extraction technique. 1. INTRODUCTION Data mining is the process of extraction of hidden, predictive information from large databases. The overall Knowledge Discovery and Data Mining (KDD) process deals with turning low level data into high level knowledge. The process of data mining begins with the understanding of the application domain. This includes relevant prior knowledge as well as the goals of the system. First, data cleaning and pre-processing is carried out on the raw data for removal of noise and handling of missing data. Next, data reduction and projection are performed to find the minimal set features to represent the data. An appropriate data mining model is used to extract the patterns for classification. Finally, the knowledge obtained is incorporated into the performance system. The four important steps in data mining are pre- processing, clustering, feature extraction and classification. Pre-processing involves the transformation of raw data into a form in which it can be more useful. Two important steps in pre-processing are noise-removal and handling missing data. Pre-processing is specific to the problem in question. However, certain accepted techniques of pre-processing are transforms (Fourier, Wavelet, etc.) and data normalization. Clustering is a form of unsupervised learning, i.e., the data available is not labelled and the output is a set of clusters containing the similar points. Commonly used techniques for clustering are k-means and k-medoids. Feature extraction handles the problem of high dimensionality and using a classifier for such problems directly, becomes infeasible. Various techniques used for feature extraction are principal component analysis, independent component analysis, edge detection in case of images, etc. Classification maps the data into predefined groups or classes. The main function of the classification system is learning. Some of the tools used in classification are Artificial Neural Networks and Support Vector Machines. 2. RELATED WORK A survey on soft computing approaches to data mining is presented in [1]. The compression of waves using wavelets and their performance evaluation is discussed in [2]. A Neuro-Fuzzy system with Invariant Wavelets is used to classify EEG spikes in [3], however, the system cannot be extended to a more general system. In [4], a new feature extraction process for time series data using DWT (Discrete Wavelet Transform) and DFT (Discrete Fourier Transform) has been employed but it can be used only for a specific purpose. Michail Vlachos et.al, [5] present a novel anytime k-means clustering to evaluate feature extraction. Kohenen’s SOM (Self-Organizing Map) is used to provide additional dimensionality reduction for clustering in [6]. In [7], Wavelet transforms are used to handle high dimensional data, but the system cannot be generalized. In [8], a manual application of pre-processing techniques depending on sample characteristics using Fuzzy C- Means clustering is discussed, but this defeats the aim of automating the process of data mining. Nello Cristianini et.al, [9] describe the performance of a new SVM (Support Vector Machine) for classification. The potential