Artificial Intelligence Review 20: 39–51, 2003. © 2003 Kluwer Academic Publishers. Printed in the Netherlands. 39 Using a Genetic Algorithm and a Perceptron for Feature Selection and Supervised Class Learning in DNA Microarray Data MICHAL KARZYNSKI, ÁLVARO MATEOS, JAVIER HERRERO & JOAQUÍN DOPAZO * Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), c/ Melchor Fernández Almagro 3, 28029, Madrid, Spain ( * author for correspondence, e-mail: jdopazo@cnio.es) Abstract. Class prediction and feature selection is key in the context of diagnostic applica- tions of DNA microarrays. Microarray data is noisy and typically composed of a low number of samples and a large number of genes. Perceptrons can constitute an efficient tool for accurate classification of microarray data. Nevertheless, the large input layers necessary for the direct application of perceptrons and the low samples available for the training process hamper its use. Two strategies can be taken for an optimal use of a perceptron with a favourable balance between samples for training and the size of the input layer: (a) reducing the dimensionality of the data set from thousands to no more than one hundred, highly informative average values, and using the weights of the perceptron for feature selection or (b) using a selection of only few genes that produce an optimal classification with the perceptron. In this case, feature selection is carried out first. Obviously, a combined approach is also possible. In this manu- script we explore and compare both alternatives. We study the informative contents of the data at different levels of compression with a very efficient clustering algorithm (Self Organizing Tree Algorithm). We show how a simple genetic algorithm selects a subset of gene expression values with 100% accuracy in the classification of samples with maximum efficiency. Finally, the importance of dimensionality reduction is discussed in light of its capacity for reducing noise and redundancies in microarray data. Keywords: clustering, dimensionality reduction, feature selection, gene expression, genetic algorithm, perceptron, SOTA, weights 1. Introduction The development and popularisation of DNA microarray technology has lead to the possibility of measuring the expression level of thousands of genes in a single experiment (Brown and Botstein, 1999). Distinct experiments with different tissues, patients, etc., provide gene expression profiles under the different experimental conditions studied. DNA microarray data consists of huge matrices with thousands of rows, corresponding to the genes used in the study with a number of columns, ranging from many tens to a few hundred, corresponding to the different experimental conditions at which