© IJARW | ISSN (O) - 2582-1008
July 2024 | Vol. 6 Issue. 1
www.ijarw.com
IJARW2190 International Journal of All Research Writings 14
OVERVIEW ON STATE-OF-THE-ART DEEP LEARNING-BASED
MODELS FOR IMBALANCE CLASSIFICATION
Trang Phung T. Thu
1
, Duong Ngoc Khang
2
Faculty of Basic Science, School of Foreign Languages, Thai Nguyen University, Thai Nguyen, Vietnam
ABSTRACT
In recent years, the problem of data imbalance has become a major challenge, significantly impacting
the process of mining information from data. This often happens when some classes have significantly
larger samples than other classes. With the development of deep learning, there have been significant
advances in representing and understanding information from images. However, when applying deep
learning to practical image recognition tasks, the problem of "deep long-tail" becomes apparent.
Training models to face even rare cases helps create robust and flexible models that are able to adapt
well to real-world data fluctuations. This paper aims to comprehensively analyze the long-tail problem
in image recognition, summarize the highlights and limitations of previous methods, and provide a
view on future research directions.
Keyword: Deep learning, Imbalance classification, Long-tailed classification
1. INTRODUCTION
The advent of deep neural networks has led to
notable breakthroughs in many fields such as
computer vision [1, 2], speech recognition [3, 4],
natural language processing [5], etc. However,
deep learning models learn features from large
amounts of data and therefore inevitably rely
heavily on it. So, deep learning faces challenges
due to the existence of problems in the data itself.
Asymmetric classification is a type of classification
problem in machine learning, in which the number
of data samples of the classes is unbalanced. We
call it the long tail distribution. In reality, the
number of data samples of the classes is not
balanced. Specifically, one class has a much larger
number of samples than another class. Among
them, a few classes that account for most of the
data are the head classes, while the majority of the
classes with very few data samples are the tail
classes (represented in Figure 1). Long-tailed
distributions are visible in many cases. For
example, in the field of economics, where the long
tail theory first appeared, it was used to
distinguish between red ocean and blue ocean
strategies in the marketplace. In the sales
business, there are few "best-selling products" but
have high sales volume, belonging to the top class.
While "popular goods" have many types, sales are
low, belonging to the tail class. In image
recognition, there are also many cases involving
long-tail problems, such as segmentation, scene
classification, etc.
Figure 1. Distribution of long-tail datasets. For
example, dog and budgie are common classes and
eagle and boar are uncommon classes
Chris Anderson [6], the first person to propose the
long-tail theory algorithm, believes that the future
of business and culture lies not in popular
products but in the infinite road ahead. This shows
the importance of research on long-tailed classes.
From the above perspective, we should not only
focus on the early layers but also pay attention to
the last layers in data research.