Received: 3 August 2018 Revised: 7 July 2019 Accepted: 26 September 2019
DOI: 10.1111/coin.12246
ORIGINAL ARTICLE
Proportional data modeling via selection and
estimation of a finite mixture of scaled
Dirichlet distributions
Nuha Zamzami
1,2
Rua Alsuroji
1,3
Oboh Eromonsele
1
Nizar Bouguila
1
1
Concordia Institute for Information
Systems Engineering (CIISE), Concordia
University, Montreal Qubec, Canada
2
Faculty of Computing and Information
Technology, King Abdulaziz University,
Jeddah, Saudi Arabia
3
College of Computers and Information
Systems, Umm Al-Qura University,
Makkah, Saudi Arabia
Correspondence
Nuha Zamzami, Concordia Institute for
Information Systems Engineering,
Concordia University, 1515 St.Catherine
Street West, H3G 2W1, Montreal, Canada.
Email: n_zamz@encs.concordia.ca
Present Address
Concordia Institute for Information
Systems Engineering, Concordia
University 1515 St.Catherine Street West,
H3G 2W1, Montreal, Canada. Telephone:
+1 (514)209-5257, Ext. 7176
Abstract
This paper proposes an unsupervised algorithm for
learning a finite mixture of scaled Dirichlet distribu-
tions. Parameters estimation is based on the maximum
likelihood approach, and the minimum message length
(MML) criterion is proposed for selecting the optimal
number of components. This research work is motivated
by the flexibility issues of the Dirichlet distribution, the
widely used model for multivariate proportional data,
which has prompted a number of scholars to search for
generalizations of the Dirichlet. By introducing the extra
parameters of the scaled Dirichlet, several useful statis-
tical models could be obtained. Experimental results are
presented using both synthetic and real datasets. More-
over, challenging real-world applications are empirically
investigated to evaluate the efficiency of our proposed
statistical framework.
KEYWORDS
mixture modeling, model selection, proportional data clustering,
scaled Dirichlet distribution, unsupervised learning
1 INTRODUCTION
Abundant collections of digital data are explosively growing everyday, for example, scientific data,
medical data, demographic data, financial data, and marketing data.
1,2
Therefore, the extrac-
tion of useful implicit patterns or knowledge from that huge amount of data is a competitive
Computational Intelligence. 2019;1–27. wileyonlinelibrary.com/journal/coin © 2019 Wiley Periodicals, Inc. 1