ANALYZING GENE EXPRESSION PROFILES WITH ICA D. Lutter, K. Stadlthanner, F. Theis, E. W. Lang Institute of Biophysics University of Regensburg 93040 Regensburg, Germany email: elmar.lang@biologie.uni-regensburg.de A. M. Tom“ e DET/IEETA University of Aveiro 3810-193 Aveiro email: ana@ieeta.pt B. Becker, Th. Vogt Clinic of Dermatology University Hospital Regensburg 93053 Regensburg, Germany ABSTRACT High-throughput genome-wide measurements of gene tran- script levels have become available with the recent devel- opment of microarray technology. Intelligent and efficient mathematical and computational analysis tools are needed to read and interpret the information content buried in those large scale gene expression patterns at various levels of resolution. But the development of such methods is still in its infancy. Modern machine learning and data mining techniques based on information theory, like independent component analysis (ICA), consider gene expression pat- terns as a superposition of independent expression modes which are considered putative independent biological pro- cesses. We focus on two widely used ICA algorithms to blindly decompose gene expression profiles into indepen- dent component profiles representing underlying biological processes. These exploratory methods will be capable of detecting similarity, locally or globally, in gene expression patterns and help to group genes into functional categories - for example, genes that are expressed to a greater or lesser extent in response to a drug or an existing disease. KEY WORDS Independent component analysis, microarrays, gene ex- pression profiles, FastICA, JADE 1 Introduction Environmental stimuli or stimuli representing the inter- nal state of cells induce or repress genes via up- or down-regulating the amounts of corresponding mRNA molecules. Different experimental conditions show differ- ent characteristic expression patterns. Gene expression is controlled by a combination of mechanisms including net- works of signaling substances, transcriptional factors and their binding sites in the promotor regions of genes, as well as modifications of the chromatin structure and different types of post-transcriptional regulation. The expression of each gene thus relies on the specific processing of a number of regulatory inputs. High-throughput genome-wide measurements of gene transcript levels have become available with the recent development of microarray technology [1]. Intelligent and efficient mathematical and computational analysis tools are needed to read and interpret the information content buried in those large data sets (for a recent review see [2, 3]. Traditionally two strategies exist to analyze such data sets. If prior knowledge is available about the samples, a supervised analysis can identify gene expression patterns, called features, specific to a given class but also classify new samples. Without any hypothesis unsupervised ap- proaches can discover novel biological mechanisms and re- veal genetic regulatory networks in large data sets. Such unsupervised analysis methods for microarray data analysis can be divided into clustering approaches, model-based ap- proaches and projection methods. The former group genes with similar behavior under similar experimental condi- tions, making it possible to analyze data within each group separately. It is supposed that genes within a cluster are functionally related. In general no attempt is made to model the underlying biology. A drawback of this method is that clusters generally are disjunct but genes may be part of several biological processes. Model-based approaches try to explain the interactions among the biological entities participating in gene regulatory networks. Parameters of the model are trained on expression data sets. With com- plex models not enough data may be available to estimate the parameters. Also algorithms are often of prohibitive complexity and computational load. The latter methods expand the data in a basis with a desired property. Two projection methods mainly exist: PCA projects data onto mutually orthogonal princi- pal components [4]. Each principal component captures the maximum information, i.e. variance, that is not already present in the previous components. PCA is the optimal dimension-reduction technique in the sum of squared er- ror sense, hence can be used for data compression as well. Dimension reduction of expression data is usually applied in visualization, filtering of noise or reducing the compu- tational load of subsequent computations. With microarray data the principal components are called eigenarrays. ICA decomposes the data in statistically independent components (ICs) [5]. Usually a linear superposition of the underlying unknown source signals is assumed but nonlin- ear ICA algorithms also exist. Unsupervised approaches like independent component analysis (ICA) represent a ver- satile tool for microarray analysis where it extracts expres- sion modes, the ICs. Each retrieved IC is considered a pu- tative biological process, which can be characterized by the functional annotations of genes that are predominant within