Online variational learning of generalized Dirichlet mixture models with feature selection Wentao Fan a , Nizar Bouguila b,n a Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada H3G 1T7 b The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada H3G 1T7 article info Article history: Received 20 December 2011 Received in revised form 15 April 2012 Accepted 10 September 2012 Available online 13 June 2013 Keywords: Online learning Mixture models Feature selection Variational inference Generalized Dirichlet Documents clustering abstract Three frequently recurring themes in machine learning, data mining and related disciplines are clustering, feature selection and online learning. Motivated by the importance of these themes which are generally interrelated, we propose a statistical framework for simultaneous online clustering and feature selection using finite generalized Dirichlet mixture model. The proposed framework allows to control overfitting by, dynamically and simultaneously, adjusting the mixture model's parameters, number of components and the features weights. We describe a principled variational approach for learning the parameters of the proposed statistical model. Results on both synthetic data and real applications involving online documents and images clustering show the merits of the proposed approach. & 2013 Elsevier B.V. All rights reserved. 1. Introduction Advances in computing power and data storage technology have led to the development of several machine learning and data mining techniques [1–3]. In particular, several supervised and unsupervised learning approaches have been proposed [4]. Finite mixture models are among the most widely used techniques for unsupervised learning [5]. However, the deployment of these models is not a trivial task. Challenges associated with the adoption of finite mixture models come from many fronts: one must choose an appropriate distribution associated with the different mixture components; one must consider a suitable approach for the learning task (i.e. both estimation of the parameters and selection of the number of mixture components) of the mixture model; one must develop an efficient feature selection approach to avoid overfitting and improve general- ization performance; and one must deal with model updating issues by taking into account the dynamic nature of real data sets where the number of feature vectors grows as the collection of data continues. Finite Gaussian mixture models have been widely used for modeling multidimensional data sets in many fields. However, the Gaussian mixture is not an appropriate choice when the partitions of the data set are clearly non-Gaussian. This is especially true for proportional vectors (i.e. normalized count vectors) which enter- tain two restrictions namely nonnegativity and unit-sum con- straint. For this kind of data mixtures of Dirichlet [6,7] and generalized Dirichlet (GD) [8–11] distributions have been shown to be better alternatives and have proven to be of high value and potential in several real-life applications. An advantage of the GD is, however, the fact has it has a more general covariance structure than the Dirichlet which covariance is strictly negative. This makes the GD more practical and useful in Bayesian learning scenarios in general and finite mixture modeling in particular as shown in [8,12]. In particular, the authors in [12] have proposed a learning algorithm for simultaneous feature selection and clustering of high-dimensional data modeled using GD mixtures. The work in [12] can be viewed as an extension of [13], based on the classic Gaussian distribution with diagonal covariance, and has been shown to prevent overfitting and to improve scalability, interpret- ability and also generalization capabilities of the resulting mixture models. Yet, like the majority of existing simultaneous clustering and feature selection techniques 1 , this approach works in batch settings. Indeed, to the best of our knowledge to date, online learning of finite mixture models 2 has mostly been studied with- out taking into account the important problem of feature selection (see, for instance, [17,18]).The problem is challenging since both model complexity (i.e. number of clusters) and feature relevancy Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.09.047 n Corresponding author. Tel.: +1 514 848 2424. E-mail addresses: wenta_fa@encs.concordia.ca (W. Fan), bouguila@ciise.concordia.ca, nizar.bouguila@concordia.ca (N. Bouguila). 1 A lot of work has been devoted to feature selection in the past and it is very difficult to do justice to the many models and contributions that have been proposed. Some recent important techniques dealing with feature selection in model-based and non-model-based cluster analysis have been reviewed and studied empirically in [14]. 2 It is noteworthy that the online learning of several other statistical models has been also the topic of extensive research in the past (see, for instance, [15,16]). Neurocomputing 126 (2014) 166–179