International Journal of Computer Applications (0975 8887) Volume 122 No.2, July 2015 8 Comparative Study among Data Reduction Techniques over Classification Accuracy Ibrahim M. El-Hasnony Faculty of Computer Science & Information Systems, Mansoura University, Mansoura, Egypt Hazem M. El Bakry Faculty of Computer Science & Information Systems, Mansoura University, Mansoura, Egypt Ahmed A. Saleh Faculty of Computer Science & Information Systems, Mansoura University, Mansoura, Egypt ABSTRACT Nowadays, Healthcare is one of the most critical issues that need efficient and effective analysis. Data mining provides many techniques and tools that help in getting a good analysis for healthcare data. Data classification is a form of data analysis for deducting models. Mining on a reduced version of data or a lower number of attributes increases the efficiency of system providing almost the same results. In this paper, a comparative study between different data reduction techniques is introduced. Such comparison is tested against classification algorithms accuracy. The results showed that fuzzy rough feature selection outperforms rough set attribute selection, gain ratio, correlation feature selection and principal components analysis. General Terms Data mining, bioinformatics Keywords Fuzzy rough feature selection, rough set attribute reduction, principal component analysis, correlation feature selection, gain ratio 1. INTRODUCTION The revolution in medical data volumes is considered a problem not just for the enormous size, but also for the incremental speed of the data creation and complexity [1]. There are many sources of medical data such as mobile applications, capturing devices, and sensors that all results from new technologies development. Such huge medical data will be troublesome for processing or examination utilizing basic database management tools. Clearly, catching, putting away, seeking, and breaking down medical huge data to discover valuable results of knowledge will enhance the results of the social insurance frameworks. Also through intelligent decisions and effective explanatory algorithms health awareness cost will be lowered. Because of the huge amount of data that reaches to several gigabytes or more, it is possible for medical databases to be exposed to many problems such as noise, missing and data inconsistency [3]. Data pre-processing enhances the quality of data, along these lines serving to improve the precision and proficiency of the consequent mining procedure. Knowledge discovery process depends mainly on data pre- processing. The decision’s efficiency depends mainly on the quality of data, hence data pre-processing is considered an imperative stride in the learning disclosure process. The process of detecting data abnormalities, redressing them early, and data reduction prompts tremendous adjustments for decision making. If the data used in the analysis process is large, the data mining process will be slow. Data reduction acquires a decreased representation for the data sets that have volume smaller than the original, yet delivers almost the same results or analytical output. Dimension reduction or attribute reduction[2] of substantial data sets has dependably been a search area, particularly for the data sets included in the healthcare field. The attributes of these data sets are not all applicable for the purpose of classification. From the perspective of classification, it is essential to hold just those attributes that maximize the classification effectiveness. Data reduction handles not only reducing the number of attributes but also reducing the instances as well. The major of data reduction depends on attributes reduction. Hence when the pre- processing is done for reducing attributes, the most important aspect is producing the reduct with the same effectiveness as the original data set. The reduct is the lowest number of attributes that the original data depends on. The proposed model evaluates data reduction techniques along with classification algorithms with metric accuracy. Such model composes of pre-processing phase and classification phase. The pre-processing phase handles noisy data and makes comparative study among different features reduction techniques such as gain ratio, rough set attribute reduction (RSAR), fuzzy rough feature selection (FRFS), principal components analysis (PCA) and correlation feature selection (CFS). The model is tested against classification algorithms accuracies such as C4.5, fuzzy rough nearest neighbor, Multi- layer perceptron (MLP), Nearest-neighbor-like algorithm using non-nested generalized exemplars(NNGE), Fuzzy nearest neighbor, sequential minimum optimization(SMO), classification via clustering , NB-tree and naïve Bayes algorithms. The results showed that fuzzy rough feature selection technique for data reduction is reasonable more than other algorithms with medical data. Also comparison showed that classification algorithms depending on FRFS achieve higher accuracies than those depending on other data reduction algorithms. Moreover, CFS in many cases has achieved good results. The rest of this paper is organized as follows. Section 2 highlights the most recent researches in medical data pre- processing and classification. Section 3 presents materials and methodologies which the proposed model depends on. Section 4 introduces the proposed model framework. Experimental results and conclusions are showed in sections 5 and 6 respectively.