Artificial Intelligence Review 20: 39–51, 2003.
© 2003 Kluwer Academic Publishers. Printed in the Netherlands.
39
Using a Genetic Algorithm and a Perceptron for Feature Selection
and Supervised Class Learning in DNA Microarray Data
MICHAL KARZYNSKI, ÁLVARO MATEOS, JAVIER HERRERO &
JOAQUÍN DOPAZO
*
Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), c/ Melchor
Fernández Almagro 3, 28029, Madrid, Spain (
*
author for correspondence, e-mail:
jdopazo@cnio.es)
Abstract. Class prediction and feature selection is key in the context of diagnostic applica-
tions of DNA microarrays. Microarray data is noisy and typically composed of a low number
of samples and a large number of genes. Perceptrons can constitute an efficient tool for
accurate classification of microarray data. Nevertheless, the large input layers necessary for the
direct application of perceptrons and the low samples available for the training process hamper
its use. Two strategies can be taken for an optimal use of a perceptron with a favourable balance
between samples for training and the size of the input layer: (a) reducing the dimensionality of
the data set from thousands to no more than one hundred, highly informative average values,
and using the weights of the perceptron for feature selection or (b) using a selection of only
few genes that produce an optimal classification with the perceptron. In this case, feature
selection is carried out first. Obviously, a combined approach is also possible. In this manu-
script we explore and compare both alternatives. We study the informative contents of the data
at different levels of compression with a very efficient clustering algorithm (Self Organizing
Tree Algorithm). We show how a simple genetic algorithm selects a subset of gene expression
values with 100% accuracy in the classification of samples with maximum efficiency. Finally,
the importance of dimensionality reduction is discussed in light of its capacity for reducing
noise and redundancies in microarray data.
Keywords: clustering, dimensionality reduction, feature selection, gene expression, genetic
algorithm, perceptron, SOTA, weights
1. Introduction
The development and popularisation of DNA microarray technology has lead
to the possibility of measuring the expression level of thousands of genes
in a single experiment (Brown and Botstein, 1999). Distinct experiments
with different tissues, patients, etc., provide gene expression profiles under
the different experimental conditions studied. DNA microarray data consists
of huge matrices with thousands of rows, corresponding to the genes used
in the study with a number of columns, ranging from many tens to a few
hundred, corresponding to the different experimental conditions at which