Abstract—Air pollution is a considerable health danger to the environment. The objective of this study was to assess the characteristics of air quality and predict PM 10 concentrations using boosted regression trees (BRTs). The maximum daily PM 10 concentration data from 2002 to 2016 were obtained from the air quality monitoring station in Kuching, Sarawak. Eighty percent of the monitoring records were used for the training and twenty percent for the validation of the models. The best iteration of the BRT model was performed by optimizing the prediction performance, while the BRT algorithm model was constructed from multiple regression models. The two main parameters that were used were the learning rate (lr) and tree complexity (tc), which were fixed at 0.01 and 5, respectively. Meanwhile, the number of trees (nt) was determined by using an independent test set (test), a 5-fold cross validation (CV) and out-of-bag (OOB) estimation. The algorithm model for the BRT produced by using the CV was the best guide to be used compared with the OOB to test the predicted PM 10 concentration. The performance indicators showed that the model was adequate for the next day’s prediction (PA=0.638, R 2 =0.427, IA=0.749, NAE=0.267, and RMSE=28.455). Index Terms—Accuracy measures, air Pollution, boosted regression trees, PM 10 , regression. I. INTRODUCTION In Malaysia, air quality is monitored continuously throughout the country by the Department of Environment (DOE) at 65 stations. Afroz et al. [1] discussed air pollution caused by open burning and forest fires in Malaysia, which has become harmful to the public health and the environment. According to the [2], PM 10 and O 3 are the major causes of unhealthy days recorded in Malaysia. PM 10 is particulate matter with an aerodynamic diameter of less than 10 μm [3]. It is one of the main causes of pneumoconiosis, when it enters the bronchus, alveoli, and so on. The smaller the size of the dust particles, the deeper into the respiratory tract they enter Manuscript received November 12, 2020; revised January 22, 2021. This work has been carried out as part of the statutory activity of the Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Malaysia. This research was funded by Malaysia Government under Fundamental Research Grant, grant number 600-IRMI/FRGS 5/3 (289/2019). The authors are with the Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 13500 Permatang Pauh, Malaysia (e-mail: shaziayani@uitm.edu.my, ahmadzia101@uitm.edu.my, syarifah.adilah@uitm.edu.my, zuraira946@uitm.edu.my, hasfazilah@uitm.edu.my). [4]. Previously, many studies were conducted to predict future PM 10 concentrations using a variety of methods. The multiple linear regression (MLR) method is the most common method used to predict PM 10 concentrations. Juneng et al. [5] used the MLR method in their study to analyse the predictive relationship between the dependent variable (PM 10 ) and the independent variables. It was shown that local meteorological factors, particularly local surface air temperature, local humidity and local wind speed, dominate the fluctuations of PM 10 over the Klang Valley during the summer monsoon. Moreover, Ul-Saufie et al. [6] used a quantile regression model to predict future (next day, next 2 days and next 3 days) PM 10 concentration levels in Seberang Perai, Malaysia, and compared the results with the MLR. Despite the success of the MLR, according to [7], it presents problems in identifying the most important contributors when there is a high correlation or multicollinearity between the independent variables in the regression equation. Typically, one of the favoured techniques for predicting a complex system involves the use of artificial neural networks (ANN), such as the ANN model that was used by [8] to predict PM 10 concentrations from the hourly data of a subway platform. According to [9], the predictive aspect of validation in the ANN model is not sufficient enough to fully assess the ability of the developed model to completely capture the underlying dynamics between independent and dependent variables. BRTs are very reliable and flexible for dealing with complex responses, including interactions and nonlinearities [10]. The BRT algorithm is a single algorithm that is a combination of regression trees. The regression tree stops growing with repeated binary splits when certain criteria are met. In recent years, BRTs have been successfully implemented in air quality forecasting applications [11]-[14].Table I lists recent studies that have been conducted on air pollution in Malaysia. It shows that limited study have been conducted to predict PM 10 concentrations using a BRT in Malaysia. A BRT works very well with large datasets and is robust with regard to missing values or outliers. Therefore, this study was conducted to predict PM 10 concentrations using the BRT approach which had been developed by [15]. In contrast, this study used maximum daily data compared to hourly and averaged daily data that had been used by other researcher. Furthermore, this study used BRT to predict for the next day and it is different from BRT prediction that had been produced by [16]. Evaluation of Boosted Regression Tree for the Prediction of the Maximum 24-Hour Concentration of Particulate Matter Wan Nur Shaziayani, Ahmad Zia Ul-Saufie, Syarifah Adilah Mohamed Yusoff, Hasfazilah Ahmat, and Zuraira Libasin International Journal of Environmental Science and Development, Vol. 12, No. 4, April 2021 126 doi: 10.18178/ijesd.2021.12.4.1329