Unsupervised Greedy Learning of Finite Mixture Models Nicola Greggio §† , Alexandre Bernardino † , Cecilia Laschi § , Jos´ e Santos-Victor † and Paolo Dario §‡ § ARTS Lab - Scuola Superiore S.Anna, Polo S.Anna Valdera Viale R. Piaggio, 34 - 56025 Pontedera, Italy ‡ CRIM Lab - Scuola Superiore S.Anna, Polo S.Anna Valdera Viale R. Piaggio, 34 - 56025 Pontedera, Italy † Instituto de Sistemas e Rob´ otica, Instituto Superior T´ ecnico 1049-001 Lisboa, Portugal Email: n.greggio@isr.ist.utl.pt Abstract—This work deals with a new technique for the estimation of the parameters and number of components in a ﬁnite mixture model. The learning procedure is performed by means of a expectation maximization (EM) methodology. The key feature of our approach is related to a top-down hierarchical search for the number of components, together with the integration of the model selection criterion within a modiﬁed EM procedure, used for the learning the mixture parameters. We start with a single component covering the whole data set. Then new components are added and optimized to best cover the data. The process is recursive and builds a binary tree like structure that effectively explores the search space. We show that our approach is faster that state-of-the- art alternatives, is insensitive to initialization, and has better data ﬁts in average. We elucidate this through a series of experiments, both with synthetic and real data. Keywords-Machine Learning, Unsupervised Clustering, Self- Adapting Expectation Maximization, Image Processing I. I NTRODUCTION Unsupervised clustering classiﬁes different data into classes based on redundancies contained within the data sample. Fitting a mixture model to the distribution of the data is equivalent, in some applications, to the identiﬁcation of the clusters with the mixture components [1]. The normal, or Gaussian, distribution is one of the most widely used, throughout statistics, natural science, and social science as a simple model for complex phenomena. The selection of the right number of components is a critical issue. The more components there are within the mixture, the better the data ﬁt will be. However, this will lead to data overﬁtting and to increase in the computational burden as well. The selection of the best number of components in a mixture distribution can be performed in two ways, mainly: off-line and on-line. On one hand, the off-line procedures evaluate the best model by executing independent runs of the EM algo- rithm for many different initializations, and evaluating each estimate with criteria that penalize complex models (e.g. the Akaike Information Criterion (AIC) [2], the Schwarz’s Bayesian Information Criterion [3], the Rissanen Minimum Description Length (MDL) [4], and Wallace and Freeman Minimum Message Lenght (MML) [5]). All of these criteria, in order to be effective, have to be evaluated for every possible number of models under comparison. Therefore, it is obvious that, for having a sufﬁcient search range the complexity goes with the number of tested models as well as the model parameters. On the other hand, the on-line procedures start with a ﬁxed set of models and sequentially adjust their conﬁguration (including the number of components) based on different evaluation criteria. Pernkopf and Bouchaffra proposed a Genetic-Based EM Algorithm capable of learning gaussians mixture models [6]. They ﬁrst selected the number of components by means of the minimum description length (MDL) criterion. A combination of genetic algorithms with the EM has been explored. Among the on-line techniques it is possible to distinguish three different categories: Those that only increment the number of components, those that both increment and reduce them (split and merge techniques), and those that reduce it, merely. A greedy algorithm is characterized by making the locally optimal choice at each stage with the hope of ﬁnding the global optimum. Applied to the EM algorithm, they usually start with a single component (therefore side- stepping the EM initialization problem), and then increase their number during the computation. However, the big issue in these kind of algorithm is the insertion selection criterion: Deciding when inserting a new component and how can determine the success or failure of the subsequent computation. In 2002 Vlassis and Likas introduced a greedy algorithm for learning Gaussian mixtures [7]. They start with a single component covering all the data, then splitting an element and performing the EM locally for optimizing only the two modiﬁed components. Nevertheless, the total complexity for the global search of the element to be split O(n 2 ). Subsequently, Verbeek et al. developed a greedy method to learn the gaussians mixture model conﬁguration [8]. Their search for the new components is claimed to take O(n). Greedy algorithms mostly (but not always) fail to ﬁnd