Received: 3 August 2018 Revised: 7 July 2019 Accepted: 26 September 2019 DOI: 10.1111/coin.12246 ORIGINAL ARTICLE Proportional data modeling via selection and estimation of a finite mixture of scaled Dirichlet distributions Nuha Zamzami 1,2 Rua Alsuroji 1,3 Oboh Eromonsele 1 Nizar Bouguila 1 1 Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal Qubec, Canada 2 Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia 3 College of Computers and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia Correspondence Nuha Zamzami, Concordia Institute for Information Systems Engineering, Concordia University, 1515 St.Catherine Street West, H3G 2W1, Montreal, Canada. Email: n_zamz@encs.concordia.ca Present Address Concordia Institute for Information Systems Engineering, Concordia University 1515 St.Catherine Street West, H3G 2W1, Montreal, Canada. Telephone: +1 (514)209-5257, Ext. 7176 Abstract This paper proposes an unsupervised algorithm for learning a finite mixture of scaled Dirichlet distribu- tions. Parameters estimation is based on the maximum likelihood approach, and the minimum message length (MML) criterion is proposed for selecting the optimal number of components. This research work is motivated by the flexibility issues of the Dirichlet distribution, the widely used model for multivariate proportional data, which has prompted a number of scholars to search for generalizations of the Dirichlet. By introducing the extra parameters of the scaled Dirichlet, several useful statis- tical models could be obtained. Experimental results are presented using both synthetic and real datasets. More- over, challenging real-world applications are empirically investigated to evaluate the efficiency of our proposed statistical framework. KEYWORDS mixture modeling, model selection, proportional data clustering, scaled Dirichlet distribution, unsupervised learning 1 INTRODUCTION Abundant collections of digital data are explosively growing everyday, for example, scientific data, medical data, demographic data, financial data, and marketing data. 1,2 Therefore, the extrac- tion of useful implicit patterns or knowledge from that huge amount of data is a competitive Computational Intelligence. 2019;1–27. wileyonlinelibrary.com/journal/coin © 2019 Wiley Periodicals, Inc. 1