Robust Estimation of Gaussian Mixtures from Noisy Input Data Shaobo Hou Aphrodite Galata School of Computer Science The University of Manchester Abstract We propose a variational bayes approach to the problem of robust estimation of gaussian mixtures from noisy input data. The proposed algorithm explicitly takes into account the uncertainty associated with each data point, makes no assumptions about the structure of the covariance matri- ces and is able to automatically determine the number of the gaussian mixture components. Through the use of both synthetic and real world data examples, we show that by in- corporating uncertainty information into the clustering al- gorithm, we get better results at recovering the true distri- bution of the training data compared to other variational bayesian clustering algorithms. 1. Introduction Standard EM-based clustering algorithms [7] assume that input data points are all equally important and the ef- fect of noise or measurement errors is often ignored and not explicitly modeled during model estimation. However, this is not always a valid assumption since the input data can be corrupted by measurement errors. Uncertainties can also be introduced by additional transformations on the data such as dimensionality reduction. In particular, in the case of non- linear transformations, it is no longer safe to assume that the uncertainties are uniform across the data. If the level of un- certainties can be quantiﬁed, it makes sense to incorporate them into the clustering algorithm to improve the estimation of the true data distribution. In this paper we propose a novel algorithm for learning a mixture of Gaussians that takes into account the uncer- tainties of the input data. In our formulation we assume the uncertainty on a data point can be modelled by a multivari- ate Gaussian distribution and is independent from the other data points. Intuitively, this allows a data point with large uncertainty to exert less inﬂuence on the mixture compo- nents, than a data point with smaller uncertainty. The opti- mal mixture model that represents the data with uncertain- ties is found using a variational bayesian algorithm that au- tomatically chooses the appropriate number of components in the mixture model. We show that by taking into account the uncertainty of information, our algorithm performs bet- ter at estimating the correct number of clusters and recov- ering the true distribution of the training data compared to other variational bayesian clustering algorithms [6, 3]. The proposed algorithm is evaluated on a number of synthetic and real data sets and is shown to improve the results of various pattern recognition tasks such as motion segmenta- tion and partitioning of microarray gene expressions. 2. Related Work Previously, researchers have looked into the problem of incorporating uncertainty information into the ﬁeld of model ﬁtting and have developed algorithms such as To- tal Least Square [1] which assumes the noise on all data points are drawn from the same uncertainty distribution, and Fundamental Numerical Scheme [5] which allows different data points to be associated with different uncertainty dis- tributions. It has also been studied in the context of support vector machine classiﬁcation [2]. Various researchers have also investigated the problem of unsupervised clustering of data with uncertainty. Chaudhuri and Bhowmik [4] proposed a modiﬁed K-means algorithm which assumes uniform uncertainties such that the true po- sition of a data point can be anywhere within a hypersphere centred on its observed position. Kumar and Patel [11] also proposed generalisations of K-means and hierarchical clus- tering to handle zero-mean Gaussian measurement uncer- tainty. However their formulation is simply based on intu- ition and is not probabilistically well principled. Handman and Govaert [8] proposed an EM [7] clustering algorithm which modelled non-identically distributed uncertainty as rectangular error zones. More recently in Bioinformatics, in order to group genes with similar expression patterns from microarray experi- ments, Liu et al. [12] proposed a probabilistic clustering algorithm for estimating the maximum likelihood mixture of Gaussians with spherical covariance matrices from data with zero-mean Gaussian measurement errors represented 978-1-4244-2243-2/08/$25.00 ©2008 IEEE