pISSN 2302-1616, eISSN 2580-2909 Vol 8, No. 1, June 2020, pp. 41-48 Available online http://journal.uin-alauddin.ac.id/index.php/biogenesis DOI https://doi.org/10.24252/bio.v8i1.12002 Copyright © 2020. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/ ) Performance Comparison of Data Sampling Techniques to Handle Imbalanced Class on Prediction of Compound-Protein Interaction AKHMAD REZKI PURNAJAYA 1* , WISNU ANANTA KUSUMA 2 , MEDRIA KUSUMA DEWI HARDHIENATA 3 1 Department of Software Engineering, Faculty of Computer, Universal University Sungai Panas, Batam, Indonesia. 29444 *Email: rezki.purnajaya@uvers.ac.id 2 Tropical Biopharmaca Research Center, Faculty of Math and Science, IPB University Jl. Taman Kencana, RT.03/RW.03, Bogor, West Java, Indonesia. 16128 3 Department of Computer Science, Faculty of Mathematics and Natural Science, IPB University Jl. Meranti Wing 20 Level 5 Kampus IPB Darmaga, Bogor, Indonesia. 16680 Received 7 January 2020; Received in revised form 8 March 2020; Accepted 2 May 2020; Available online 30 June 2020 ABSTRACT The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis for developing new drugs as well as for drug repositioning. One challenging issue in this field is that commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much research on CPI prediction that compares data sampling techniques to handle the class imbalance problem. To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS), Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor (GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72 respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore, we found that the SMOTE technique is more capable of handling class imbalance problems on CPI prediction compared to the remaining three other techniques. Keywords: area under curve; compound-protein interaction; drug-target analysis; imbalanced class; SMOTE INTRODUCTION The identification of Compound-Protein Interaction (CPI) plays a key role in the development of drugs, particularly herbal medicines. The great advances in molecular medicine and the human genome project provide more opportunities to discover unknown associations in the CPI network. The new interactions that are discovered can be helpful for finding new drugs by screening candidate compounds and also essential to understand the causes of side effects in existing drugs (Mei et al., 2013; Hong et al., 2017). Currently, the latest computational models have been discovered in predicting of potential compound-protein interactions, including deep learning techniques (Tsubaki et al., 2019). However, at this moment, there are only a few studies available to understand the interaction between compounds and proteins. For example, PubChem and ChEMBL database store 90 million drug candidate compound records, but some compounds interaction to protein targets are still limited (Wang et al., 2017; Mendez et al., 2019). The computational method for predicting the CPI is thus essential in drug or herbal medicine studies. The method can reduce time, cost, and failure rate for discovering new drugs or herbal medicines (Kim et al., 2013). To address the above issue, some studies on CPI predictions have been conducted by Biopharmaca Research Centre in Bogor, Indonesia. Indonesia Jamu Herbs (IJAH) webserver is developed by Biopharmaca Research Center to predict the efficacy of herbal of drug formulas for various diseases using the multicomponent-multitarget network