388 ISSN 0361-7688, Programming and Computer Software, 2018, Vol. 44, No. 6, pp. 388–397. © Pleiades Publishing, Ltd., 2018. Original Russian Text © J. Vijayashree, H. Parveen Sultana, 2018, published in Programmirovanie, 2018, Vol. 44, No. 6. A Machine Learning Framework for Feature Selection in Heart Disease Classification Using Improved Particle Swarm Optimization with Support Vector Machine Classifier 1 J. Vijayashree a, * and H. Parveen Sultana a a School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India *e-mail: vijayashree.j@vit.ac.in Received July 9, 2018 Abstract—Machine learning is used as an effective support system in health diagnosis which contains large volume of data. More commonly, analyzing such a large volume of data consumes more resources and exe- cution time. In addition, all the features present in the dataset do not support in achieving the solution of the given problem. Hence, there is a need to use an effective feature selection algorithm for finding the more important features that contribute more in diagnosing the diseases. The Particle Swarm Optimization (PSO) is one of the metaheuristic algorithms to find the best solution with less time. Nowadays, PSO algorithm is not only used to select the more significant features but also removes the irrelevant and redundant features present in the dataset. However, the traditional PSO algorithm has an issue in selecting the optimal weight to update the velocity and position of the particles. To overcome this issue, this paper presents a novel function for identifying optimal weights on the basis of population diversity function and tuning function. We have also proposed a novel fitness function for PSO with the help of Support Vector Machine (SVM). The objective of the fitness function is to minimize the number of attributes and increase the accuracy. The performance of the proposed PSO-SVM is compared with the various existing feature selection algorithms such as Info gain, Chi-squared, One attribute based, Consistency subset, Relief, CFS, Filtered subset, Filtered attribute, Gain ratio and PSO algorithm. The SVM classifier is also compared with several classifiers such as Naive Bayes, Random forest and MLP. Keywords: Particle Swarm Optimization, Support Vector Machine, fitness function, ROC analysis, popula- tion diversity function, tuning function DOI: 10.1134/S0361768818060129 1. INTRODUCTION In general, the heart is found to be a most import- ant organ of human body. Thus, heart diseases are considered as a significant health issue in day-to-day life. Many reports state that the cardiovascular dis- eases are the root cause of sudden death of individuals in industrialized countries [1]. The increased death in industrialized countries affects the individuals' health and finances and budget of the countries [2]. The fol- lowing diseases are found to be most important risk factors for the cardiovascular disease it includes diabe- tes, high saturated fat, family history, fatness, smoking and high cholesterol. Nowadays, newborn babies are also affected by cardiovascular diseases. Hence, checking of cardiovascular diseases is very common in day-to-day life. Moreover, chest pain and fatigue are considered as the most familiar symptoms of getting the heart dis- ease [3, 4]. In order to overcome this issue, a number of feature selection methods are identified by the modern com- putational researchers [7, 10, 14, 16]. In this paper, recent advancements in feature selection and frontiers in heart disease predictions are discussed in detail. More commonly, feature selection algorithms are classified into two types namely consistency filter- based feature selection and correlation-based feature selection. Correlation-based feature selection is devel- oped on the basis of filter based feature selection while consistency filter-based feature selection methods select the more important features based on their con- sistency values of each feature. In correlation-based feature selection, a simple heuristic evaluation func- tion is used to rank the features subsets [17, 18]. Hence, this heuristic evaluation function signifi- cantly identifies the more significant features on the basis of their high correlation values. Thus, low cor- relation between the features is removed not only from the training dataset but also testing dataset [19, 20]. Moreover, the correlation-based feature selection 1 The article is published in the original.