147. Gaussian Mixture Density Estimation applied to Microarray Data Christine Steinhoﬀ 1 , Tobias M¨ uller 1 , Ulrike Nuber 2 , Martin Vingron 1 Keywords: Gaussian Mixture Model, EM algorithm, microarray 1 Introduction Microarray experiments can be used to determine gene expression proﬁles of speciﬁc cell types. In this way, genes of a given cell type might be categorized into active or not. Typically one would like to infer a probability for each gene whether it is expressed in a given tissue or experiment or not. This has mainly been addressed by selecting an arbitrary threshold and deﬁning spot intensities above this threshold as “high signal” and below as “low signal”. However, this approach does not yield a probabilistic measure. As a probabilistic model we propose to ﬁt a mixture of normal distributions to the data and thus to infer not only an overall description of the data but also a probabilistic framework. In the literature many methods are described in which various kinds of distributions are ﬁtted to ratios of microarray data (Ghosh and Chinnaiyan (2002), Li and Yang (2002)). Fitting distributions to the entire dataset of a single sample rather than ﬁtting the ratios of two experiment proﬁles has not been studied very well. Hoyle et al. (2002) propose to ﬁt a log-normal distribution with a tail that is close to a power law. This is - to our knowledge - the only publication in which single sample-intensities are being approximated. The use of one speciﬁc parameterized distribution however poses problems when overall intensities occur in various shapes, esp. show non-uniformly shaped tails. These eﬀects can occur for diﬀerent reasons and cannot be captured by normalization in all cases. Hoyle et al. (2002) used only the highest genes for the ﬁtting procedure. This simultaneously reduces the probability of observing a mixture of several densities but it focuses only on the highest expressed genes and provides no overall model. 2 Results Here, we present examples of microarray data which are unlikely to be properly ﬁtted by a single log-normal distribution as described by Hoyle et al. (2002). One example is displayed in Figure 1. The data rather appears to be a mixture of diﬀerent distributions and can lead to multi-modal shapes of various kinds. Diﬀerent reasons may account for this: (1) When using relatively small microarrays there might be genes being over-represented at a speciﬁc intensity-range or kinds of truncated data occur. (2) Saturation eﬀects could be another reason. (3) Also, speciﬁc eﬀects which can not be localized and captured by normalization might lead to varying shapes. (4) If one microarray-design is based on diﬀerent oligo selection procedures the resulting intensity distribution could show more than one mode. Thus we do not assume, that there is an unimodal single overall distribution which can explain microarray experiments in general. 1 Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnes- traße 63-73, D-14195 Berlin, Germany. E-mail: {christine.steinhoff|tobias.mueller| martin.vingron}@molgen.mpg.de 2 Human Molecular Genetics, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, D-14195 Berlin, Germany. E-mail: nuber@molgen.mpg.de