Combining Hierarchical Clustering and Self-Organizing Maps for Exploratory Analysis of Gene Expression Patterns Javier Herrero and Joaquı ´n Dopazo* Bioinformatics Unit, Spanish National Cancer Center (CNIO), Melchor Ferna ´ndez Almagro 3, 28029 Madrid, Spain Received March 26, 2002 Abstract: Self-organizing maps (SOM) constitute an alterna- tive to classical clustering methods because of its linear run times and superior performance to deal with noisy data. Nevertheless, the clustering obtained with SOM is dependent on the relative sizes of the clusters. Here, we show how the combination of SOM with hierarchical clustering methods constitutes an excellent tool for exploratory analysis of massive data like DNA microarray expression patterns. Keywords: DNA array, gene expression patterns, hierarchical clustering, SOM, SOTA DNA microarray technology opens up the possibility of measuring the expression level of thousands of genes in a single experiment. 1 Serial experiments measuring gene expression at different conditions, times, or distinct experiments with diverse tissues, patients, etc. allow gene expression profiles to be obtained under the different experimental conditions studied. Initial experiments suggest that genes having similar expression profiles tend to be playing similar, or at least related, roles in the cell. Since thousand of genes are involved in such experi- ments, the use of an efficient technique that allows detection of the clusters of co-expressing genes is key for the analysis of microarray expression data. Aggregative hierarchical clustering has been extensively used for this purpose, 2,3 although several authors 4 have claimed that it suffers from a lack of robustness. Another common problem shared by clustering methods based on the calculation of a distance matrix is their slow run times, which are, in the best case, quadratic. 5 This can constitute a real difficulty when thousands of items are to be analyzed. A different problem, which is not often mentioned, is derived from the size of the data set: there are obvious limitations in the visual inspection of a hierarchical tree with thousands of branches connecting thousand of items. Some authors 4,6,7 proposed the use of neural networks as a convenient alternative to aggregative hierarchical cluster methods. Unsupervised neural networks, and in par- ticular self-organizing maps (SOM), 8 circumvent some of the above-mentioned problems. They can deal with data sets containing noisy, ill-defined items with irrelevant variables and outliers and whose statistical distributions do not need to be parametric ones. SOM are robust and reasonably fast and can be easily scaled to large data sets. SOM, in its original form, converts the nonlinear statistical relationships between high- dimensional data into simple geometric relationships of their image points on a low-dimensional map (usually a two- dimensional arrangement of nodes). That is, SOM compresses the information of the original data set, but preserving, when possible, the topology. 8 Nevertheless, if the aim is to find the different clusters of co-expressing genes contained in the data set, this approach presents several problems. First, the training of the network depends on the number of items in each cluster. Thus, the clustering obtained by SOM, based on the nodes of the network, will be more dependent on the sizes of the groups than in their actual differences among profiles. This is because SOM is not clustering, but producing a reduced representation of the original data set. If irrelevant data (e.g., invariant, “flat” profiles) or some particular type of profile is overepresented, SOM will produce an output in which genes displaying this particular profile will populate the vast majority of nodes. The level of resolution for the different clusters will be different and depends on the number of representatives in the data set. Finally, the lack of a hierarchical structure makes it impossible to detect higher order relationships between clusters of profiles. In an attempt of imposing some structure based on the inter- cluster differences, To ¨ronen et al. 6 applied Sammon’s mapping 9 to the resulting map. Nevertheless, the picture provided by such mapping only puts in relieve that most of the cells in the SOM map are very similar, and consequently uninformative from the point of view of the characterization of the different classes of profiles. In addition, hierarchical clustering with a neural network can be achieved using SOTA. 7 This method splits the data set in a hierarchy of clusters and sub-clusters by means of a self-organizing process (like in SOM) but using a binary tree topology with a splitting scheme based on inter-cluster distances. 10 SOTA starts with a network of two neurons con- nected through an intermediate “mother” neuron, and after a training procedure similar to the SOM case, splits the data set into two groups. Then, one of the neurons (the one with the most heterogeneous pare of the data associated) splits, and the training re-start with three terminal nodes connected among them by means of a binary tree structure. SOTAallows stopping the growth of the hierarchy at a given level of variability, which permits a better visualization of the actual number of different patterns, regardless of the number of representatives of each pattern in the data set. *To whom correspondence should be addressed. Tel: 34 912246919. Fax: 34 912246972. E-mail: jdopazo@cnio.es. 10.1021/pr025521v CCC: $22.00 2002 American Chemical Society Journal of Proteome Research 2002, 1, 467-470 467 Published on Web 07/11/2002