1624 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010
RAMOBoost: Ranked Minority Oversampling
in Boosting
Sheng Chen, Student Member, IEEE, Haibo He, Member, IEEE, and Edwardo A. Garcia
Abstract— In recent years, learning from imbalanced data has
attracted growing attention from both academia and industry
due to the explosive growth of applications that use and produce
imbalanced data. However, because of the complex characteristics
of imbalanced data, many real-world solutions struggle to provide
robust efficiency in learning-based applications. In an effort to
address this problem, this paper presents Ranked Minority Over-
sampling in Boosting (RAMOBoost), which is a RAMO technique
based on the idea of adaptive synthetic data generation in an
ensemble learning system. Briefly, RAMOBoost adaptively ranks
minority class instances at each learning iteration according to a
sampling probability distribution that is based on the underlying
data distribution, and can adaptively shift the decision boundary
toward difficult-to-learn minority and majority class instances by
using a hypothesis assessment procedure. Simulation analysis on
19 real-world datasets assessed over various metrics—including
overall accuracy, precision, recall, F-measure, G-mean, and
receiver operation characteristic analysis—is used to illustrate
the effectiveness of this method.
Index Terms— Adaptive boosting, data mining, ensemble learn-
ing, imbalanced data.
I. I NTRODUCTION
L
EARNING from imbalanced data (imbalanced learning)
[1], [2] has become a critical and significant research
issue in many of today’s data-intensive applications, such
as financial engineering, anomaly detection, biomedical data
analysis, and many others. The amount and complexity of raw
data that is captured to monitor, analyze, and support decision-
making processes continue to grow at an incredible rate.
Consequently, this enhances the capacity for computationally
intelligent methods to play an essential role in applications
involving large amounts of data. On the other hand, these
opportunities also raise many new challenges for the research
community in general [3]–[5].
Generally speaking, any dataset that exhibits an unequal
distribution between its classes can be considered imbalanced.
In real-world applications, datasets exhibiting severe imbal-
ances are of great interest since they generally present signif-
icant difficulties for learning mechanisms. Typical imbalance
Manuscript received February 24, 2009; revised November 2, 2009, March
16, 2010, August 1, 2010, and August 5, 2010; accepted August 5, 2010. Date
of publication August 30, 2010; date of current version October 6, 2010.
S. Chen and E. A. Garcia are with the Department of Electrical and
Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030
USA (e-mail: schen5@stevens.edu; egarcia@stevens.edu).
H. He is with the Department of Electrical, Computer and Biomedical
Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:
he@ele.uri.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNN.2010.2066988
ratios can range from 1:100 in fraud detection problems [6]
to 1:100 000 in high-energy physics event classification [7].
However, imbalances of this form are just one aspect of the
imbalanced learning problem. The imbalance learning problem
generally manifests itself in two forms: relative imbalances
and absolute imbalances [1], [8]. Absolute imbalances arise in
datasets where minority examples are definitively scarce and
underrepresented, whereas relative imbalances are indicative
of datasets in which minority examples are well represented
but remain severely outnumbered by majority class examples.
Some studies have shown that the degradation of classification
performance attributed to imbalanced data is not necessarily
the result of relative imbalances but rather due to the lack
of representative examples (absolute imbalances) [9], [10],
[1], [11], [12]. In particular, for a given dataset that contains
several sub-concepts, the distribution of minority examples
over the minority class concepts may yield clusters with
insufficient representative examples to form a classification
rule [1]. This problem of concept data representation within
a class is also known as the within-class imbalance problem,
[10], [13], [14], and was verified to be more difficult to handle
than datasets with only homogeneous concepts for each class
[10], [12].
Logically, it would follow that solutions targeted at both
relative and absolute imbalances would be more adept to
handling a wide spectrum of imbalanced learning problems. To
this end, this paper proposes RAMOBoost, which is a RAMO
technique embedded with a boosting procedure to facilitate
learning from imbalanced datasets. Based on an integration
of oversampling and ensemble learning, RAMOBoost sys-
tematically generates synthetic instances by considering the
class ratios of surrounding nearest neighbors of each minority
class example in the underlying training data distribution.
Unlike many existing approaches that use uniform sampling
distributions, RAMOBoost adaptively adjusts the sampling
weights of minority class examples according to their data
distributions. Moreover, by integrating the ensemble learning
methodology, RAMOBoost adopts an iterative learning proce-
dure that assesses the hypothesis developed at each boosting
iteration to adaptively shift the decision boundary to focus
more on those difficult-to-learn instances of both the majority
and the minority classes.
We organize the remainder of this paper as follows. In
Section II, we present a brief review of the state-of-the-
art techniques proposed in the community to address the
imbalanced learning problem. In Section III, we discuss the
motivation behind the RAMOBoost framework and present
1045–9227/$26.00 © 2010 IEEE