Inteligencia Artificial 23(65), 100-114
doi: 10.4114/intartif.vol23iss65pp100-114
ISSN: 1137-3601 (print), 1988-3064 (on-line)
©IBERAMIA and the authors
INTELIGENCIA ARTIFICIAL
http://journal.iberamia.org/
Ensemble Feature Selection for Breast Cancer Classification using
Microarray Data
Supoj Hengpraprohm
[1,A]
and Suwimol Jungjit
[2,B]
[1]
Data Science Program, Faculty of Science and Technology,
Nakhon Pathom Rajabhat University, Nakhon Pathom, Thailand
[2]
Department of Computer and Information Technology, Faculty of Science,
Thaksin University, Phatthalung, Thailand
[A]
supojn@webmail.npru.ac.th,
[B]
suwimol@tsu.ac.th
Abstract This paper proposes an ensemble filter feature selection approach, EnSNR, for breast cancer data
classification. The Microarray dataset used in the experiments contains 50,739 features (genes) for each of 32
patients. The main idea of the EnSNR approach is to combine informative features which are obtained using two
different sets of feature evaluation criteria. Features in the EnSNR subset are those features which are present in
both sets of evaluation results. Entropy and SNR evaluation functions are used to generate the EnSNR feature
subset. Entropy is a measure of the amount of uncertainty in the outcome of a random experiment, while SNR is
an effective function for measuring feature discriminative power. Entropy and SNR functions provide some
advantages for the EnSNR approach. For example, the number of features in the EnSNR subset is not user-defined
(the EnSNR subset is generated automatically); and the operation of the EnSNR function is independent of the
type of classification algorithm employed. Also, only a small amount of processing time is required to generate
the EnSNR feature subset. A Genetic Algorithm (GA) generates the breast cancer classification ‘model’ using the
EnSNR feature subset. The efficiency of the ‘model’ is validated using 10-Fold Cross-Validation re-sampling.
When the ‘EnSNR’ feature subset is used, as well as giving a high degree of prediction accuracy (the average
prediction accuracy obtained in the experiments in this paper is 86.92 ± 5.47), the EnSNR approach significantly
reduces the number of irrelevant features (genes) to be analyzed for cancer classification.
Keywords: Ensemble approach, Feature selection, Microarray data, Genetic Algorithm, Cancer Classification.
1 Introduction
Breast cancer is the most common cancer in women. The reason for carrying out the research described in this
paper is to improve on the data classification prediction performance so far achieved [1, 2]. This paper
demonstrates that the proposed ‘Ensemble’ feature selection approach, ‘EnSNR’, is superior to the traditional
‘Entropy’ or ‘Signal to Noise Ratio (SNR)’ approaches, for the selection of informative features to be used in the
prediction process. The feature selection and data classification system block diagram for the experiments is
shown in Figure 1.
The block diagram shows:
• Breast Cancer Microarray Dataset. This is the source of patient data used in the experiments
• Feature Selection functions ‘Entropy’ and ‘Signal to Noise Ratio (SNR)’
• Feature Selection process ‘Ensemble (EnSNR)’