MODEL CENTROIDS FOR THE SIMPLIFICATION OF KERNEL DENSITY ESTIMATORS Olivier Schwander ⋆ , Frank Nielsen ⋆† ⋆ École Polytechnique, Palaiseau, France † Sony Computer Science Laboratories Inc., Tokyo, Japan ABSTRACT Gaussian mixture models are a widespread tool for modeling various and complex probability den- sity functions. They can be estimated using Expecta- tion–Maximization or Kernel Density Estimation. Ex- pectation–Maximization leads to compact models but may be expensive to compute whereas Kernel Density Estimation yields to large models which are cheap to build. In this paper we present new methods to get high-quality models that are both compact and fast to compute. This is accomplished with clustering methods and centroids computation. The quality of the result- ing mixtures is evaluated in terms of log-likelihood and Kullback-Leibler divergence using examples from a bioinformatics application. Index Terms— Kernel Density Estimation, simpli- fication, Expectation–Maximization, k-means, Fisher- Rao centroid 1. INTRODUCTION Statistical methods are nowadays commonplace in mod- ern signal processing. There are basically two major ap- proaches for modeling experimental data by probabil- ity distributions: we may either (1) consider a semi- parametric modeling by a finite mixture model learnt from the Expectation Maximization (EM) procedure, or alternatively (2) choose a non-parametric modeling us- ing a kernel density estimator (KDE). On the one hand mixture modeling requires to fix or learn the number of components but provides a use- ful compact representation of data. On the other hand, KDE finely describes the underlying empirical distri- bution at the expense of the dense model size. In this paper, we present a novel statistical modeling method Version updated on 04/24/2012 that simplifies efficiently a KDE model with respect to an underlying distance between Gaussian kernels. We consider Fisher-Rao metric or Kullback-Leibler diver- gence. Since the underlying Fisher-Rao geometry of Gaussian is hyperbolic without closed-form equation for the centroids, we rather adopt a close approximation that bears the name of hyperbolic model centroid, and show its use in single-step clustering method. We report on our experiments that shows that the KDE simplification paradigm is a competitive approach over the classical EM, both in terms of processing time and quality. 2. MODELING DISTRIBUTIONS Mixtures of Gaussian distributions are a widespread tool for modeling complex data in a lot of various domains, from image processing to medical data analysis through speech recognition. This success is due to the capac- ity of Gaussian Mixture Models (GMM) to estimate the probability function (pdf) of complex random variables. For a mixture f of n components, the probability density function takes the form: f (x)= n i=1 ω i g(x; μ i ,σ 2 i ) (1) where ω i denotes the weight of component i ( ∑ ω i = 1). Each component g(x; μ i ,σ 2 i ) is a normal distribu- tion with the pdf: g(x; μ, σ 2 )= 1 √ 2πσ 2 exp − (x − μ) 2 2σ 2 (2) Such a mixture can be built using the celebrated Expectation-Maximization algorithm (EM) which iter- atively estimates the parameters that maximize the like- lihood.