388
ISSN 0361-7688, Programming and Computer Software, 2018, Vol. 44, No. 6, pp. 388–397. © Pleiades Publishing, Ltd., 2018.
Original Russian Text © J. Vijayashree, H. Parveen Sultana, 2018, published in Programmirovanie, 2018, Vol. 44, No. 6.
A Machine Learning Framework for Feature Selection
in Heart Disease Classification Using Improved Particle Swarm
Optimization with Support Vector Machine Classifier
1
J. Vijayashree
a,
* and H. Parveen Sultana
a
a
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
*e-mail: vijayashree.j@vit.ac.in
Received July 9, 2018
Abstract—Machine learning is used as an effective support system in health diagnosis which contains large
volume of data. More commonly, analyzing such a large volume of data consumes more resources and exe-
cution time. In addition, all the features present in the dataset do not support in achieving the solution of the
given problem. Hence, there is a need to use an effective feature selection algorithm for finding the more
important features that contribute more in diagnosing the diseases. The Particle Swarm Optimization (PSO)
is one of the metaheuristic algorithms to find the best solution with less time. Nowadays, PSO algorithm is
not only used to select the more significant features but also removes the irrelevant and redundant features
present in the dataset. However, the traditional PSO algorithm has an issue in selecting the optimal weight to
update the velocity and position of the particles. To overcome this issue, this paper presents a novel function
for identifying optimal weights on the basis of population diversity function and tuning function. We have also
proposed a novel fitness function for PSO with the help of Support Vector Machine (SVM). The objective of
the fitness function is to minimize the number of attributes and increase the accuracy. The performance of
the proposed PSO-SVM is compared with the various existing feature selection algorithms such as Info gain,
Chi-squared, One attribute based, Consistency subset, Relief, CFS, Filtered subset, Filtered attribute, Gain
ratio and PSO algorithm. The SVM classifier is also compared with several classifiers such as Naive Bayes,
Random forest and MLP.
Keywords: Particle Swarm Optimization, Support Vector Machine, fitness function, ROC analysis, popula-
tion diversity function, tuning function
DOI: 10.1134/S0361768818060129
1. INTRODUCTION
In general, the heart is found to be a most import-
ant organ of human body. Thus, heart diseases are
considered as a significant health issue in day-to-day
life. Many reports state that the cardiovascular dis-
eases are the root cause of sudden death of individuals
in industrialized countries [1]. The increased death in
industrialized countries affects the individuals' health
and finances and budget of the countries [2]. The fol-
lowing diseases are found to be most important risk
factors for the cardiovascular disease it includes diabe-
tes, high saturated fat, family history, fatness, smoking
and high cholesterol. Nowadays, newborn babies are
also affected by cardiovascular diseases. Hence,
checking of cardiovascular diseases is very common in
day-to-day life.
Moreover, chest pain and fatigue are considered as
the most familiar symptoms of getting the heart dis-
ease [3, 4].
In order to overcome this issue, a number of feature
selection methods are identified by the modern com-
putational researchers [7, 10, 14, 16]. In this paper,
recent advancements in feature selection and frontiers
in heart disease predictions are discussed in detail.
More commonly, feature selection algorithms are
classified into two types namely consistency filter-
based feature selection and correlation-based feature
selection. Correlation-based feature selection is devel-
oped on the basis of filter based feature selection while
consistency filter-based feature selection methods
select the more important features based on their con-
sistency values of each feature. In correlation-based
feature selection, a simple heuristic evaluation func-
tion is used to rank the features subsets [17, 18].
Hence, this heuristic evaluation function signifi-
cantly identifies the more significant features on the
basis of their high correlation values. Thus, low cor-
relation between the features is removed not only from
the training dataset but also testing dataset [19, 20].
Moreover, the correlation-based feature selection
1
The article is published in the original.