147. Gaussian Mixture Density Estimation applied to Microarray Data Christine Steinhoff 1 , Tobias M¨ uller 1 , Ulrike Nuber 2 , Martin Vingron 1 Keywords: Gaussian Mixture Model, EM algorithm, microarray 1 Introduction Microarray experiments can be used to determine gene expression profiles of specific cell types. In this way, genes of a given cell type might be categorized into active or not. Typically one would like to infer a probability for each gene whether it is expressed in a given tissue or experiment or not. This has mainly been addressed by selecting an arbitrary threshold and defining spot intensities above this threshold as “high signal” and below as “low signal”. However, this approach does not yield a probabilistic measure. As a probabilistic model we propose to fit a mixture of normal distributions to the data and thus to infer not only an overall description of the data but also a probabilistic framework. In the literature many methods are described in which various kinds of distributions are fitted to ratios of microarray data (Ghosh and Chinnaiyan (2002), Li and Yang (2002)). Fitting distributions to the entire dataset of a single sample rather than fitting the ratios of two experiment profiles has not been studied very well. Hoyle et al. (2002) propose to fit a log-normal distribution with a tail that is close to a power law. This is - to our knowledge - the only publication in which single sample-intensities are being approximated. The use of one specific parameterized distribution however poses problems when overall intensities occur in various shapes, esp. show non-uniformly shaped tails. These effects can occur for different reasons and cannot be captured by normalization in all cases. Hoyle et al. (2002) used only the highest genes for the fitting procedure. This simultaneously reduces the probability of observing a mixture of several densities but it focuses only on the highest expressed genes and provides no overall model. 2 Results Here, we present examples of microarray data which are unlikely to be properly fitted by a single log-normal distribution as described by Hoyle et al. (2002). One example is displayed in Figure 1. The data rather appears to be a mixture of different distributions and can lead to multi-modal shapes of various kinds. Different reasons may account for this: (1) When using relatively small microarrays there might be genes being over-represented at a specific intensity-range or kinds of truncated data occur. (2) Saturation effects could be another reason. (3) Also, specific effects which can not be localized and captured by normalization might lead to varying shapes. (4) If one microarray-design is based on different oligo selection procedures the resulting intensity distribution could show more than one mode. Thus we do not assume, that there is an unimodal single overall distribution which can explain microarray experiments in general. 1 Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnes- traße 63-73, D-14195 Berlin, Germany. E-mail: {christine.steinhoff|tobias.mueller| martin.vingron}@molgen.mpg.de 2 Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, D-14195 Berlin, Germany. E-mail: nuber@molgen.mpg.de