AbstractThe problem of handling a class imbalanced problem by modifying decision tree algorithm has received widespread attention in recent years. A new splitting measure, called class overlapping-balancing entropy (OBE), is introduced in this paper that essentially pay attentions to all classes equally. Each step, the proportion of each class is balanced via the assigned weighted values. They not only depend on equalizing each class, but they also take into account the overlapping region between classes. The proportion of weighted values corresponding to each class is used as the component of Shannon's entropy for splitting the current dataset. From the experimental results, OBE significantly outperforms the conventional splitting measures like Gini index, gain ratio and DCSM, which are used in the well-known decision tree algorithms. It also exhibits superior performance compared to AE and ME that are designed for handling the class imbalanced problem specifically. Index TermsClassification problem, class imbalanced learning, class overlapping-balancing entropy, decision tree algorithm. I. INTRODUCTION A decision tree is recognized as one of the top 10 classification models [1]. The success of using the decision tree can be explained by three characteristics. First, a decision tree algorithm consumes small computational time for constructing the model, especially during the predicting step. Second, a decision tree has the easy interpretation for humans that it has been used for ranking variable importance. Third, a decision tree is robust with respect to anomalies and missing values. However, like most well-known classifiers, a decision tree algorithm must face the hassle of classifying a dataset with extremely unequal class distribution [2]. This problem plays an important role in knowledge discovery and data mining for the past several years, which is known as a class imbalanced problem. It widely appears in several real-world situations such as fraud detection [3], [4], disease diagnosis [5], [6], network intrusion detection [7], industrial systems monitoring [8] and sentiment analysis [9]. To minimize the accuracy of classification, the decision tree algorithm often build a tree that predicts most unknown instances to be the class containing a large number of instances, called the majority class. Hence, instances from Manuscript received February 10, 2020; revised March 5, 2020. The authors are with the Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand (e-mail: a.sagoolmuang@gmail.com, krung.s@chula.ac.th). the class containing a tiny number of instances, called the minority class, tend to be incorrectly classified. In the real-world problem, the smaller class is frequently more important and receives much attention to correctly classify. For example, in fraud detection, there is a small number of fraudulent transactions, but they are significant and must be discovered. In the same way as disease diagnosis, the prediction of disease patients is more critical than normal people. Many methods have been presented to deal with the class imbalanced problem using various techniques [10], [11]. The idea of developing the algorithm to build the decision tree classifier that is suitable for classifying an imbalanced dataset is one of the methods that have received wide attention. Normally, the improvement of decision tree algorithm usually focuses on modifying the splitting measures to separate dataset in each node. Traditional splitting measures, especially Gini index [12] and Shannon's entropy [13], have been improved using many concepts in recent years. Asymmetric entropy (AE) [14], off-centered entropy (OCE) [15], [16] and AECID [17] apply the concept of non-symmetry instead of the symmetric one. They shift the maximum value of entropy from the middle of extreme proportions as the symmetric entropy, to be biased toward the minority class. In addition, the skew-insensitive splitting measures are suggested for dealing with the class imbalanced problem, such as DKM [18], [19] and HDDT [20], [21]. They can condone a considerable difference between the number of instances in the minority class and the majority class. Lastly and most importantly, the concept of modifying the components of the Gini index and the Shannon's entropy calculation to be inclined towards minority class are introduced in CART+Resampling [22] and minority entropy (ME) [23], respectively. They discard majority instances that do not affect the split decision of minority instances. CART+Resampling applies the sampling method directly, while ME ignores majority instances outside the minority range which has the similar effect as the sampling method. This paper suggests the modification of Shannon's entropy components like ME for continuous attribute. The splitting measure designed to handle the class imbalanced problem is proposed, called class overlapping-balancing entropy (OBE). It assigns a larger weight to an instance that lies outside the overlapping region between two classes than the weights of other instances. Moreover, the sum of weights among all classes are set equal to one to make them balance. Then, the proportion of weights corresponding to each class is Decision Tree Algorithm with Class Overlapping-Balancing Entropy for Class Imbalanced Problem Artit Sagoolmuang and Krung Sinapiromsaran International Journal of Machine Learning and Computing, Vol. 10, No. 3, May 2020 444 doi: 10.18178/ijmlc.2020.10.3.955