A Reliable Weighted Feature Selection for Auto Medical Diagnosis Golnaz Sahebi Department of Future Technologies University of Turku, Finland golnaz.sahebi@utu.fi Amin Majd Department of Information Technology Abo Akademi University, Finland amin.majd@abo.fi Masoumeh Ebrahimi KTH Royal Institute of Technology, Sweden; University of Turku, Finland masebr@kth.se Juha Plosila Department of Future Technologies University of Turku, Finland juplos@utu.fi Hannu Tenhunen KTH Royal Institute of Technology, Sweden; University of Turku, Finland hannu@kth.se Abstractｲ Feature selection is a key step in data analysis. However, most of the existing feature selection techniques are serial and inefficient to be applied to massive data sets. We propose a feature selection method based on a multi-population weighted intelligent genetic algorithm to enhance the reliability of diagnoses in e-Health applications. The proposed approach, called PIGAS, utilizes a weighted intelligent genetic algorithm to select a proper subset of features that leads to a high classification accuracy. In addition, PIGAS takes advantage of multi-population implementation to further enhance accuracy. To evaluate the subsets of the selected features, the KNN classifier is utilized and assessed on UCI Arrhythmia dataset. To guarantee valid results, leave-one-out validation technique is employed. The experimental results show that the proposed approach outperforms other methods in terms of accuracy and efficiency. The results of the 16-class classification problem indicate an increase in the overall accuracy when using the optimal feature subset. Accuracy achieved being 99.70% indicating the potential of the algorithm to be utilized in a practical auto-diagnosis system. This accuracy was obtained using only half of features, as against an accuracy of 66.76% using all the features. KeywordsｲData Analysis; Feature Selection; K-Nearest Neighbor Classification; Optimization; Parallel Genetic Algorithm; E-Health. I. INTRODUCTION Heart and blood vessel diseases (Cardiovascular Diseases - CVDs) are the first cause of death in the world. Potential life threatening conditions like heart failure can be successfully avoided if arrhythmias are detected at early phases. A most valuable diagnostic means that enhances the detection of CVDs is electrocardiogram (ECG), providing a successor representation of cardiac activity [1]. In recent years, one of the most significant innovations in early detection of diseases is wearable devices, which aim at providing real-time feedback information about the health condition of a person. Besides all their advantages, wearable systems face a number of challenges to become a reality. The most important hurdle is that their processors and architectures require a large amount of energy, demanding sizable batteries. This creates challenges for reducing the size of wearable devices. While minimization is done, another challenge arises that is the reliability of decision making. The detection accuracy depends on the data analysis process. From this perspective, data analysis and machine learning algorithms play an important role [2]. The process of knowledge discovery in databases (KDD) or data analysis involves some steps, such as dataset selection, data understanding, data preparation, data analysis, result interpretation, and result evaluation [3]. An important phase in data preparation, which is one of the significant issues in the construction of classification model, is feature selection. Feature selection can be determined as a process of choosing a minimum subset of features ( ிௌ ) from the original set of features (N) so that the feature space is optimally reduced while the classification accuracy remains relatively the same [4]. Two general categories to solve the feature selection problem are filter and wrapper. In the filter approach, features are selected by statistical properties. By applying the filter approach, features can be quickly selected, but the performance of the learning models is not usually as high as that of the wrapper method as the selected feature may not be the best possible ones [5]. The wrapper technique, on the other hand, employs optimization algorithms in the learning machine techniques to find optimal subset of features. This utilization allows the use of standard optimization methods with the learning machine techniques. The wrapper approach is considered in this paper. To solve the optimization problem, there are different methods such as deterministic solutions, heuristic searches, and meta- heuristic searches [6]. In large scale datasets, the meta-heuristic approaches are more efficient regarding the NP-complete aspect of the feature selection problem [7]. Evolutionary algorithms (EAs) are a well-known class of meta-heuristic searches [6]. A dominant advantage of EAs for feature selection problems, compared with deterministic algorithms, is their capability to escape from local optima that often encounters in feature selection problems [8]. A popular group of EAs are genetic algorithms (GAs). They are population-based search techniques, which mimic the process of natural selection and evolution. A GA is started with initializing a population and then running frequent operations such as selection, crossover, mutation, and replacement. All operations of a GA are repeated until reaching a competent result or a certain iteration. [9].