Robust Estimation of Gaussian Mixtures from Noisy Input Data
Shaobo Hou Aphrodite Galata
School of Computer Science
The University of Manchester
Abstract
We propose a variational bayes approach to the problem
of robust estimation of gaussian mixtures from noisy input
data. The proposed algorithm explicitly takes into account
the uncertainty associated with each data point, makes no
assumptions about the structure of the covariance matri-
ces and is able to automatically determine the number of
the gaussian mixture components. Through the use of both
synthetic and real world data examples, we show that by in-
corporating uncertainty information into the clustering al-
gorithm, we get better results at recovering the true distri-
bution of the training data compared to other variational
bayesian clustering algorithms.
1. Introduction
Standard EM-based clustering algorithms [7] assume
that input data points are all equally important and the ef-
fect of noise or measurement errors is often ignored and not
explicitly modeled during model estimation. However, this
is not always a valid assumption since the input data can be
corrupted by measurement errors. Uncertainties can also be
introduced by additional transformations on the data such as
dimensionality reduction. In particular, in the case of non-
linear transformations, it is no longer safe to assume that the
uncertainties are uniform across the data. If the level of un-
certainties can be quantified, it makes sense to incorporate
them into the clustering algorithm to improve the estimation
of the true data distribution.
In this paper we propose a novel algorithm for learning
a mixture of Gaussians that takes into account the uncer-
tainties of the input data. In our formulation we assume the
uncertainty on a data point can be modelled by a multivari-
ate Gaussian distribution and is independent from the other
data points. Intuitively, this allows a data point with large
uncertainty to exert less influence on the mixture compo-
nents, than a data point with smaller uncertainty. The opti-
mal mixture model that represents the data with uncertain-
ties is found using a variational bayesian algorithm that au-
tomatically chooses the appropriate number of components
in the mixture model. We show that by taking into account
the uncertainty of information, our algorithm performs bet-
ter at estimating the correct number of clusters and recov-
ering the true distribution of the training data compared to
other variational bayesian clustering algorithms [6, 3]. The
proposed algorithm is evaluated on a number of synthetic
and real data sets and is shown to improve the results of
various pattern recognition tasks such as motion segmenta-
tion and partitioning of microarray gene expressions.
2. Related Work
Previously, researchers have looked into the problem
of incorporating uncertainty information into the field of
model fitting and have developed algorithms such as To-
tal Least Square [1] which assumes the noise on all data
points are drawn from the same uncertainty distribution, and
Fundamental Numerical Scheme [5] which allows different
data points to be associated with different uncertainty dis-
tributions. It has also been studied in the context of support
vector machine classification [2].
Various researchers have also investigated the problem of
unsupervised clustering of data with uncertainty. Chaudhuri
and Bhowmik [4] proposed a modified K-means algorithm
which assumes uniform uncertainties such that the true po-
sition of a data point can be anywhere within a hypersphere
centred on its observed position. Kumar and Patel [11] also
proposed generalisations of K-means and hierarchical clus-
tering to handle zero-mean Gaussian measurement uncer-
tainty. However their formulation is simply based on intu-
ition and is not probabilistically well principled. Handman
and Govaert [8] proposed an EM [7] clustering algorithm
which modelled non-identically distributed uncertainty as
rectangular error zones.
More recently in Bioinformatics, in order to group genes
with similar expression patterns from microarray experi-
ments, Liu et al. [12] proposed a probabilistic clustering
algorithm for estimating the maximum likelihood mixture
of Gaussians with spherical covariance matrices from data
with zero-mean Gaussian measurement errors represented
978-1-4244-2243-2/08/$25.00 ©2008 IEEE