Possibilistic Clustering based on Robust Modeling of Finite Generalized Dirichlet Mixture M. Maher Ben Ismail and Hichem Frigui Multimedia Research laboratory, CECS dept. University of Louisville, USA mmbeni01@louisville.edu h.frigui@louisville.edu Abstract We propose a novel possibilistic clustering algorithm based on robust modelling of the Generalized Dirich- let (GD) finite mixture. The algorithm generates two types of membership degrees. The first one is a posterior probability that indicates the degree to which the point fits the estimated distribution. The second membership represents the degree of “typicality” and is used to in- dentify and discard noise points. The algorithm mini- mizes one objective function to optimize GD mixture pa- rameters and possibilistic membership values. This op- timization is done iteratively by dynamically updating the Dirichlet mixture parameters and the membership values in each iteration. We compare the performance of the proposed algorithm with an EM based approach. We show that the possibilistic approach is more robust. 1. Introduction During the last two decades, finite mixture models [1] have emerged as a flexible and powerfull modelling tool for probabilistic model based clustering. Finite mixtures naturally model data samples which are as- sumed to have been produced by one of a set of al- ternatives random sources. Inferring the parameters of these sources and identifying which source produced each sample leads to the problem of data clustering. De- spite all recent progress, this is still an open research problem. The problem is more acute when the data are corrupted by noise and is high dimensional. Gaussian mixtures, with assumed diagonal covariance matrices for components, have been used frequently [2]. How- ever, Gaussian functions cannot approximate asymmet- ric distributions. Recently, Generalized Dirichlet (GD) mixture have been adopted as a good alternative [3]. In [5], the au- thors proved that GD is more appropriate for modelling data that are compactly supported, such as data originat- ing from videos, images, or text. Moreover, GD distri- butions could be transformed to yield features that are independent and follow Beta distributions. Thus, the conditional independence assumption among features, commonly used for data clustering [6] to model high- dimensional data, becomes a fact for GD samples with- out loss of accuracy. The problem of estimating the parameters of GD mix- ture has been the subject of diverse studies [8], and the maximum likelihood method (ML) [1, 11] is the most common approach. Another approach is to use the ex- pectation maximization (EM) [5, 11]. However, these methods do not perform well when the data are noisy. In fact, noise points and outliers can drastically affect the esimate of the model parameters and, hence, the fi- nal clustering partition. To overcome this limitation, we propose a possibilistic approach for GD mixture parameter estimation and data clustering. Our approach generates possibilistic mem- bership functions which represent the “typicality” of each data point. This is in addition to the posterior prob- abilities which indicate how well each point fits within the estimated distribution. 2. Possibilistic Clustering based on Robust Geneneralized Dirichlet Mixture Model Let Y = ( -→ Y 1 , -→ Y 2 , ..., --→ Y N ) be a set of N points where Y i R d . We assume that Y is generated by a mixture of GD distributions with parameters θ * = ( -→ θ * 1 , -→ θ * 2 , ..., -→ θ * M , p 1 , ..., p M ), where -→ θ * j , is the parameter vec- tor of the jth GD component and p j are the mixing weights. The finite GD mixture models the data using p( - Y|θ * )= M j=1 p j p( - Y| -→ θ * j ), (1) where p( - Y| -→ θ * j ) is the GD distribution. Each -→ θ * j = (α * j1 * j1 * j2 * j2 , ..., α * jd * jd ) is the set of parameters 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.145 577 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.145 577 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.145 573 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.145 573 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.145 573