ISSN 2394-3777 (Print)
ISSN 2394-3785 (Online)
Available online at www.ijartet.com
International Journal of Advanced Research Trends in Engineering and Technology (IJARTET)
Vol. 3, Issue 9, September 2016
All Rights Reserved © 2016 IJARTET 1
An Efficient Mechanism for Classification of
Imbalanced Data
Krithika M V
1
, Rajeev Bilagi
2
, Dr. Prashanth C M
3
1
Post Graduate Student, Dept of CS&E, SCE Bangalore, India.
2
Mr.Rajeev Bilagi, Assoc. Prof, Dept of CS&E, SCE Bangalore, India.
3
Dr. Prashanth C M, Prof & HOD, Dept of CS&E, SCE Bangalore, India.
Abstract: In many real world applications, there is wide increment in data generation and storage. The classification
algorithms are facing a problem in the categorization of highly imbalanced datasets. Classification methods dealt so far
centred only on binary class imbalanced problem. All the classification algorithms are biased towards the majority class
ignoring most of the significant samples present in the minority class. To resolve this issue, a method called Hybrid
Sampling technique is proposed to deal with multi class imbalanced data. The proposed method is an efficient method
because it acts by balancing the data distribution of all the classes and imbibes efficient sample selection strategy to
undersample the majority class. Experiments are performed using various classifiers and the results of proposed system
prove that the classification prediction rate improves when a balanced data having different category of class groupings are
considered.
Keywords: Classification, data mining, Imbalance Problems, K means Clustering, Multi Class Imbalanced data, Sampling
Techniques, Stratified Sampling
I. INTRODUCTION
Most of the real world applications have to identify
the occurrence of rare events from very large datasets. Data
mining techniques analyse massive amount of data from
various sources and resolve issues of various views by
summing them up into useful information [1]. Decision
making needs good outlined ways for exploration of data or
cognition from various areas. Data mining is the prediction
of efficacious information from massive datasets.
Classification or categorization plays a pivotal role in the
application space of data mining. Classification involves
assigning a class label to a set of undefined examples.
Classification becomes a serious issue with highly
skewed dataset. The classification algorithms proposed so
far dealt with imbalanced binary class problem. Evaluating
and negotiating the problem of imbalanced data in multiple
class domain has been proposed in this study. Class
imbalance [2] problem is a predominant issue in the field of
data mining and machine learning techniques. All the
classification algorithms are biased towards the majority
classes, ignoring most of the minority classes that occur very
rarely but are found to be the most important.
1.1 Class Imbalanced Data Problem
Class imbalance problem is said to occur when the
count on collection of samples in one class (superior class) is
not less than half the count of the other class (inferior
classes). A class that has largest count on the collection
items is related as majority class (superior/negative class)
and the one that has comparatively less count on the
collection items is related as minority (inferior) class or a
positive class. As the superior class has large number of
training instances, the classifiers show desirable accuracy
rates upon observing such class but the categorization rate
drops down when an inferior class is observed.
Classification algorithms [3] on imbalanced dataset show
poor performances due to the following reasons:
The goal of any classification algorithm is to
minimize the overall error rates.
They assume the class distribution of different class
labels as equal.
Misclassification error rates of all the classes are
considered to be equal [4].
Most of the data mining algorithms assume uniform
distribution of records among all the classes.