Received October 13, 2020, accepted November 8, 2020, date of publication November 17, 2020, date of current version December 15, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3038658 Empirical Comparison of Approaches for Mitigating Effects of Class Imbalances in Water Quality Anomaly Detection EUSTACE M. DOGO 1 , NNAMDI I. NWULU 1 , BHEKISIPHO TWALA 2 , (Senior Member, IEEE), AND CLINTON OHIS AIGBAVBOA 3 1 Department of Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg 2006, South Africa 2 Faculty of Engineering and the Built Environment, Durban University of Technology, Durban 4000, South Africa 3 Sustainable Human Settlement and Construction Research Centre, Faculty of Engineering and the Built Environment, University of Johannesburg, Johannesburg 2006, South Africa Corresponding author: Eustace M. Dogo (eustaced@uj.ac.za) This work was supported by the University of Johannesburg, South Africa. ABSTRACT Imbalanced class distribution and missing data are two common problems and occurrences in water quality anomaly detection domain. Learning algorithms in an imbalanced dataset can yield an overrated classiﬁcation accuracy driven by a bias towards the majority class at the expense of the minority class. On the other hand, missing values in data can induce complexity in the learning classiﬁers during data analysis. These two problems pose substantial challenges to the performance of learning algorithms in real-life water quality anomaly detection problems. Hence, the need for them to be carefully considered and addressed to achieve better performance. In this paper, the performance of a range of several combinations of techniques to deal with imbalanced classes in the context of binary-imbalanced water quality anomaly detection problem and the presence of missing values is extensively compare. The methods considered include seven missing data and eight resampling methods, on ten different learning state-of-the-art classiﬁers taking into account diversity in their learning philosophies. The different classiﬁers are evaluated using stratiﬁed 5-fold cross- validation, based on three performance evaluation metrics namely accuracy, ROC-AUC and F1-measure. Further experiments are carried out on nineteen variants of homogeneous and heterogeneous ensemble techniques embedded with resampling and missing value strategies during their training phase as well as an optimized deep neural network model. The experimental results show an improvement in the performance of the learning classiﬁers, especially when dealing with the class imbalance problem (on the one hand) and the incomplete data problem (on the other hand). Furthermore, the neural network model exhibit superior performance when dealing with both problems. INDEX TERMS Class-imbalance, data preprocessing, imputation, machine learning, resampling, water quality. I. INTRODUCTION There is a consensus that easy access to water of good quality to the public leads to improved health and living conditions, and has a direct impact on the economy and national security of countries. Furthermore, due to the massive amount of data currently generated by water utilities and the impact of the water industry on the lives of people [1]. There is a need to implement better ways of water quality monitor- ing and prediction based on new and advanced technologies The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du . such as new and enhanced machine learning and data min- ing techniques [2]. Imbalanced class distribution (ICD) and missing values (MV) in data are two common problems and occurrences in data analysis that are synonymous with data quality issues [3]–[5]. MV and ICD continue to be prevalent in numerous real-world problems and across many applica- tion areas [6], [7], including water quality anomaly detec- tion domain. Consequently, these occurrences have continued to generate lots of attention from researchers because the majority of conventional predictive machine learning algo- rithms are not developed to handle these challenge in data, because they assume completeness of data and a balanced VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 218015