Unsupervised Greedy Learning of Finite Mixture Models Nicola Greggio §† , Alexandre Bernardino , Cecilia Laschi § , Jos´ e Santos-Victor and Paolo Dario §‡ § ARTS Lab - Scuola Superiore S.Anna, Polo S.Anna Valdera Viale R. Piaggio, 34 - 56025 Pontedera, Italy CRIM Lab - Scuola Superiore S.Anna, Polo S.Anna Valdera Viale R. Piaggio, 34 - 56025 Pontedera, Italy Instituto de Sistemas e Rob´ otica, Instituto Superior T´ ecnico 1049-001 Lisboa, Portugal Email: n.greggio@isr.ist.utl.pt Abstract—This work deals with a new technique for the estimation of the parameters and number of components in a finite mixture model. The learning procedure is performed by means of a expectation maximization (EM) methodology. The key feature of our approach is related to a top-down hierarchical search for the number of components, together with the integration of the model selection criterion within a modified EM procedure, used for the learning the mixture parameters. We start with a single component covering the whole data set. Then new components are added and optimized to best cover the data. The process is recursive and builds a binary tree like structure that effectively explores the search space. We show that our approach is faster that state-of-the- art alternatives, is insensitive to initialization, and has better data fits in average. We elucidate this through a series of experiments, both with synthetic and real data. Keywords-Machine Learning, Unsupervised Clustering, Self- Adapting Expectation Maximization, Image Processing I. I NTRODUCTION Unsupervised clustering classifies different data into classes based on redundancies contained within the data sample. Fitting a mixture model to the distribution of the data is equivalent, in some applications, to the identification of the clusters with the mixture components [1]. The normal, or Gaussian, distribution is one of the most widely used, throughout statistics, natural science, and social science as a simple model for complex phenomena. The selection of the right number of components is a critical issue. The more components there are within the mixture, the better the data fit will be. However, this will lead to data overfitting and to increase in the computational burden as well. The selection of the best number of components in a mixture distribution can be performed in two ways, mainly: off-line and on-line. On one hand, the off-line procedures evaluate the best model by executing independent runs of the EM algo- rithm for many different initializations, and evaluating each estimate with criteria that penalize complex models (e.g. the Akaike Information Criterion (AIC) [2], the Schwarz’s Bayesian Information Criterion [3], the Rissanen Minimum Description Length (MDL) [4], and Wallace and Freeman Minimum Message Lenght (MML) [5]). All of these criteria, in order to be effective, have to be evaluated for every possible number of models under comparison. Therefore, it is obvious that, for having a sufficient search range the complexity goes with the number of tested models as well as the model parameters. On the other hand, the on-line procedures start with a fixed set of models and sequentially adjust their configuration (including the number of components) based on different evaluation criteria. Pernkopf and Bouchaffra proposed a Genetic-Based EM Algorithm capable of learning gaussians mixture models [6]. They first selected the number of components by means of the minimum description length (MDL) criterion. A combination of genetic algorithms with the EM has been explored. Among the on-line techniques it is possible to distinguish three different categories: Those that only increment the number of components, those that both increment and reduce them (split and merge techniques), and those that reduce it, merely. A greedy algorithm is characterized by making the locally optimal choice at each stage with the hope of finding the global optimum. Applied to the EM algorithm, they usually start with a single component (therefore side- stepping the EM initialization problem), and then increase their number during the computation. However, the big issue in these kind of algorithm is the insertion selection criterion: Deciding when inserting a new component and how can determine the success or failure of the subsequent computation. In 2002 Vlassis and Likas introduced a greedy algorithm for learning Gaussian mixtures [7]. They start with a single component covering all the data, then splitting an element and performing the EM locally for optimizing only the two modified components. Nevertheless, the total complexity for the global search of the element to be split O(n 2 ). Subsequently, Verbeek et al. developed a greedy method to learn the gaussians mixture model configuration [8]. Their search for the new components is claimed to take O(n). Greedy algorithms mostly (but not always) fail to find