133 ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION Ariani Indrawati 1* , Hendro Subagyo 2 , Andre Sihombing 3 , Wagiyah 4 , Sjaeful Afandi 5 1,2,3,4,5 Indonesian Institute of Science * Correspondence: indrawati.ariani@gmail.com ABSTRACT The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions. ABSTRAK Dataset yang tidak seimbang jika digunakan pada kecerdasan buatan, machine learning, dan data mining sering kali memberikan hasil yang keliru. Hal tersebut dikarenakan algoritma machine learning dirancang untuk berkerja secara optimal dengan data yang seimbang. Namun, sering kali kita diharuskan untuk melakukan proses analisis data menggunakan dataset yang tidak seimbang. Cara yang paling umum digunakan untuk menangani permasalahan ketidakseimbangan data adalah dengan melakukan resampling untuk mengubah jumlah data pada kelas mayoritas atau minoritas sehingga membentuk dataset yang seimbang. Beberapa teknik resampling telah diajukan, baik oversampling, undersampling, maupun kombinasi dari keduanya. Teknik resampling ini memungkinkan untuk meningkatkan atau menurunkan performa dari model klasifikasi. Teknik resampling dengan data terstruktur sudah banyak diterapkan pada beberapa penelitian, namun penerapan resampling pada data tidak terstruktur belum banyak dilakukan. Hal tersebut menimbulkan pertanyaan apakah teknik resampling dapat diterapkan pada tidak terstruktur seperti teks yang memiliki dimensi yang banyak dan karakter yang sangat beragam. Pada penelitian ini kami mencoba menerapkan teknik resampling pada dataset artikel Indonesian Scientific Journal Database (ISJD) untuk memahami bagaimana pengaruhnya terhadap beberapa model klasifikasi. Dari hasil eksperimen diketahui bahwa secara umum teknik resampling ini dapat meningkatkan performa dari model klasifikasi, namun tidak memberikan hasil yang signifikan. Keywords: Imbalanced data; Resampling techniques; Machine learning; Classification; Journal; ISJD 1. INTRODUCTION The problem of imbalanced data has got more and more hot topics in recent years. Imbalance data is the condition where the number of instances in one class is significantly lower than the other classes. Imbalance data is a challenging problem in artificial intelligence, machine learning, and data mining topic. Most machine learning algorithms are designated to work best with balanced data that the target classes have similar prior probabilities. However, the real situation is often the ratios of prior probabilities between classes are extremely skewed in the high dimensionality and extremely sparse. Submission: 02-06-2020; Review: 30-08-2020; Accepted: 07-09-2020; Revised: 30-10-2020 ISSN 0125-9008 (Print); ISSN 2301-8593 (Online) DOI: https://dx.doi.org/10.14203/j.baca.v41i2.563 SK Dirjen Risbang -Kemristekdikti No 21/E/KPT/2018 (Peringkat 2 SINTA)