Received 4 March 2023, accepted 3 April 2023, date of publication 6 April 2023, date of current version 13 April 2023. Digital Object Identifier 10.1109/ACCESS.2023.3265019 Boosting Algorithm to Handle Unbalanced Classification of PM 2.5 Concentration Levels by Observing Meteorological Parameters in Jakarta-Indonesia Using AdaBoost, XGBoost, CatBoost, and LightGBM TONI TOHARUDIN 1 , REZZY EKO CARAKA 1,2,3 , (Member, IEEE), INDAH RESKI PRATIWI 1 , YUNHO KIM 3 , PRANA UGIANA GIO 4 , ANJAR DIMARA SAKTI 5 , MAENGSEOK NOH 6 , FARID AZHAR LUTFI NUGRAHA 1 , RESA SEPTIANI PONTOH 1 , TAFIA HASNA PUTRI 1 , THALITA SAFA AZZAHRA 1 , JESSICA JESSLYN CERELIA 1 , GUMGUM DARMAWAN 1 , AND BENS PARDAMEAN 7,8 1 Department of Statistics, Faculty of Mathematics and Natural Science, Padjadjaran University, West Java 45361, Indonesia 2 Research Center for Data and Information Sciences, Research Organization for Electronics and Informatics, National Research and Innovation Agency, Bandung, West Java 40135, Indonesia 3 Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea 4 Department of Mathematics, Universitas Sumatera Utara, Medan 20155, Indonesia 5 Remote Sensing and Geographic Information Sciences Research Group, Faculty of Earth Sciences and Technology, Bandung Institute of Technology, Bandung 40132, Indonesia 6 College of Information Technology and Convergence, Pukyong National University, Busan 48513, South Korea 7 Bioinformatics and Data Science Research Center, Bina Nusantara University, Jakarta 11480, Indonesia 8 Department of Computer Science, BINUS Graduate Program–Master of Computer Science Program, Bina Nusantara University, Jakarta 11480, Indonesia Corresponding authors: Toni Toharudin (toni.toharudin@unpad.ac.id) and Yunho Kim (yunhokim@unist.ac.kr) The work of Toni Toharudin, Rezzy Eko Caraka, Resa Septiani Pontoh, and Gumgum Darmawan was supported by the Directorate of Research and Community Engagement, Universitas Padjadjaran. The work of Rezzy Eko Caraka in part by the National Research Foundation of Korea under Grant NRF-2023R1A2C1006845. The work of Yunho Kim was supported by the National Research Foundation of Korea under Grant NRF-2022R1A5A1033624 and Grant NRF-2023R1A2C1006845. The work of Rezzy Eko Caraka and Maengseok Noh was supported by the ‘‘Leaders in Industry-University Cooperation 3.0’’ Research Project by the Ministry of Education and the National Research Foundation of Korea. ABSTRACT Air quality conditions are now more severe in the Jakarta area that is among the world’s top eight worst cities according to the 2022 Air Quality Index (AQI) report. In particular, the data from the Meteorological, Climatological, and Geophysical Agency (BMKG) of the Republic of Indonesia, the latest outcomes in air quality conditions in Jakarta and surrounding areas, says that PM 2.5 concentrations have increased and peaked at 148μg/m 3 in 2022. While a classifcation system for this pollution is necessary and critical, the observation of PM 2.5 concentrations measured through the BMKG Kemayoran station, Jakarta, turns out to be identifed as an unbalanced data class. Thus, in this work, we perform boosting algorithm supervised learning to handle such an unbalanced classifcation toward PM 2.5 concentration levels by observing meteorological patterns in Jakarta during 1 January 2015 to 7 July 2022. The boosting algorithms considered in this research include Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and Light Gradient Boosting Machine (LightGBM). Our simulations have proven that boosting classifcation can signifcantly reduce bias in combination with variance reduction with unbalanced within-class coeffcients, with the classifcation of PM 2.5 class values: good 62%, moderate 34%, and unhealthy 59%, respectively. INDEX TERMS Boosting, unbalanced classifcation, PM 2.5 , XGBoost, AdaBoost, LightGBM, CatBoost. The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojun Steven Li . I. INTRODUCTION Data science is an applied science that studies explicitly and analyzes data. In today’s digital and big data era, data 35680 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023