© IJARW | ISSN (O) - 2582-1008 July 2024 | Vol. 6 Issue. 1 www.ijarw.com IJARW2190 International Journal of All Research Writings 14 OVERVIEW ON STATE-OF-THE-ART DEEP LEARNING-BASED MODELS FOR IMBALANCE CLASSIFICATION Trang Phung T. Thu 1 , Duong Ngoc Khang 2 Faculty of Basic Science, School of Foreign Languages, Thai Nguyen University, Thai Nguyen, Vietnam ABSTRACT In recent years, the problem of data imbalance has become a major challenge, significantly impacting the process of mining information from data. This often happens when some classes have significantly larger samples than other classes. With the development of deep learning, there have been significant advances in representing and understanding information from images. However, when applying deep learning to practical image recognition tasks, the problem of "deep long-tail" becomes apparent. Training models to face even rare cases helps create robust and flexible models that are able to adapt well to real-world data fluctuations. This paper aims to comprehensively analyze the long-tail problem in image recognition, summarize the highlights and limitations of previous methods, and provide a view on future research directions. Keyword: Deep learning, Imbalance classification, Long-tailed classification 1. INTRODUCTION The advent of deep neural networks has led to notable breakthroughs in many fields such as computer vision [1, 2], speech recognition [3, 4], natural language processing [5], etc. However, deep learning models learn features from large amounts of data and therefore inevitably rely heavily on it. So, deep learning faces challenges due to the existence of problems in the data itself. Asymmetric classification is a type of classification problem in machine learning, in which the number of data samples of the classes is unbalanced. We call it the long tail distribution. In reality, the number of data samples of the classes is not balanced. Specifically, one class has a much larger number of samples than another class. Among them, a few classes that account for most of the data are the head classes, while the majority of the classes with very few data samples are the tail classes (represented in Figure 1). Long-tailed distributions are visible in many cases. For example, in the field of economics, where the long tail theory first appeared, it was used to distinguish between red ocean and blue ocean strategies in the marketplace. In the sales business, there are few "best-selling products" but have high sales volume, belonging to the top class. While "popular goods" have many types, sales are low, belonging to the tail class. In image recognition, there are also many cases involving long-tail problems, such as segmentation, scene classification, etc. Figure 1. Distribution of long-tail datasets. For example, dog and budgie are common classes and eagle and boar are uncommon classes Chris Anderson [6], the first person to propose the long-tail theory algorithm, believes that the future of business and culture lies not in popular products but in the infinite road ahead. This shows the importance of research on long-tailed classes. From the above perspective, we should not only focus on the early layers but also pay attention to the last layers in data research.