1624 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010 RAMOBoost: Ranked Minority Oversampling in Boosting Sheng Chen, Student Member, IEEE, Haibo He, Member, IEEE, and Edwardo A. Garcia Abstract— In recent years, learning from imbalanced data has attracted growing attention from both academia and industry due to the explosive growth of applications that use and produce imbalanced data. However, because of the complex characteristics of imbalanced data, many real-world solutions struggle to provide robust efficiency in learning-based applications. In an effort to address this problem, this paper presents Ranked Minority Over- sampling in Boosting (RAMOBoost), which is a RAMO technique based on the idea of adaptive synthetic data generation in an ensemble learning system. Briefly, RAMOBoost adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to-learn minority and majority class instances by using a hypothesis assessment procedure. Simulation analysis on 19 real-world datasets assessed over various metrics—including overall accuracy, precision, recall, F-measure, G-mean, and receiver operation characteristic analysis—is used to illustrate the effectiveness of this method. Index Terms— Adaptive boosting, data mining, ensemble learn- ing, imbalanced data. I. I NTRODUCTION L EARNING from imbalanced data (imbalanced learning) [1], [2] has become a critical and significant research issue in many of today’s data-intensive applications, such as financial engineering, anomaly detection, biomedical data analysis, and many others. The amount and complexity of raw data that is captured to monitor, analyze, and support decision- making processes continue to grow at an incredible rate. Consequently, this enhances the capacity for computationally intelligent methods to play an essential role in applications involving large amounts of data. On the other hand, these opportunities also raise many new challenges for the research community in general [3]–[5]. Generally speaking, any dataset that exhibits an unequal distribution between its classes can be considered imbalanced. In real-world applications, datasets exhibiting severe imbal- ances are of great interest since they generally present signif- icant difficulties for learning mechanisms. Typical imbalance Manuscript received February 24, 2009; revised November 2, 2009, March 16, 2010, August 1, 2010, and August 5, 2010; accepted August 5, 2010. Date of publication August 30, 2010; date of current version October 6, 2010. S. Chen and E. A. Garcia are with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030 USA (e-mail: schen5@stevens.edu; egarcia@stevens.edu). H. He is with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail: he@ele.uri.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2066988 ratios can range from 1:100 in fraud detection problems [6] to 1:100 000 in high-energy physics event classification [7]. However, imbalances of this form are just one aspect of the imbalanced learning problem. The imbalance learning problem generally manifests itself in two forms: relative imbalances and absolute imbalances [1], [8]. Absolute imbalances arise in datasets where minority examples are definitively scarce and underrepresented, whereas relative imbalances are indicative of datasets in which minority examples are well represented but remain severely outnumbered by majority class examples. Some studies have shown that the degradation of classification performance attributed to imbalanced data is not necessarily the result of relative imbalances but rather due to the lack of representative examples (absolute imbalances) [9], [10], [1], [11], [12]. In particular, for a given dataset that contains several sub-concepts, the distribution of minority examples over the minority class concepts may yield clusters with insufficient representative examples to form a classification rule [1]. This problem of concept data representation within a class is also known as the within-class imbalance problem, [10], [13], [14], and was verified to be more difficult to handle than datasets with only homogeneous concepts for each class [10], [12]. Logically, it would follow that solutions targeted at both relative and absolute imbalances would be more adept to handling a wide spectrum of imbalanced learning problems. To this end, this paper proposes RAMOBoost, which is a RAMO technique embedded with a boosting procedure to facilitate learning from imbalanced datasets. Based on an integration of oversampling and ensemble learning, RAMOBoost sys- tematically generates synthetic instances by considering the class ratios of surrounding nearest neighbors of each minority class example in the underlying training data distribution. Unlike many existing approaches that use uniform sampling distributions, RAMOBoost adaptively adjusts the sampling weights of minority class examples according to their data distributions. Moreover, by integrating the ensemble learning methodology, RAMOBoost adopts an iterative learning proce- dure that assesses the hypothesis developed at each boosting iteration to adaptively shift the decision boundary to focus more on those difficult-to-learn instances of both the majority and the minority classes. We organize the remainder of this paper as follows. In Section II, we present a brief review of the state-of-the- art techniques proposed in the community to address the imbalanced learning problem. In Section III, we discuss the motivation behind the RAMOBoost framework and present 1045–9227/$26.00 © 2010 IEEE