pISSN 2302-1616, eISSN 2580-2909
Vol 8, No. 1, June 2020, pp. 41-48
Available online http://journal.uin-alauddin.ac.id/index.php/biogenesis
DOI https://doi.org/10.24252/bio.v8i1.12002
Copyright © 2020. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/ )
Performance Comparison of Data Sampling Techniques to Handle Imbalanced
Class on Prediction of Compound-Protein Interaction
AKHMAD REZKI PURNAJAYA
1*
, WISNU ANANTA KUSUMA
2
, MEDRIA KUSUMA DEWI
HARDHIENATA
3
1
Department of Software Engineering, Faculty of Computer, Universal University
Sungai Panas, Batam, Indonesia. 29444
*Email: rezki.purnajaya@uvers.ac.id
2
Tropical Biopharmaca Research Center, Faculty of Math and Science, IPB University
Jl. Taman Kencana, RT.03/RW.03, Bogor, West Java, Indonesia. 16128
3
Department of Computer Science, Faculty of Mathematics and Natural Science, IPB University
Jl. Meranti Wing 20 Level 5 Kampus IPB Darmaga, Bogor, Indonesia. 16680
Received 7 January 2020; Received in revised form 8 March 2020;
Accepted 2 May 2020; Available online 30 June 2020
ABSTRACT
The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis
for developing new drugs as well as for drug repositioning. One challenging issue in this field is that
commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This
problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much
research on CPI prediction that compares data sampling techniques to handle the class imbalance problem.
To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS),
Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE),
and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor
(GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction
performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link
are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72
respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore,
we found that the SMOTE technique is more capable of handling class imbalance problems on CPI
prediction compared to the remaining three other techniques.
Keywords: area under curve; compound-protein interaction; drug-target analysis; imbalanced class;
SMOTE
INTRODUCTION
The identification of Compound-Protein
Interaction (CPI) plays a key role in the
development of drugs, particularly herbal
medicines. The great advances in molecular
medicine and the human genome project
provide more opportunities to discover
unknown associations in the CPI network. The
new interactions that are discovered can be
helpful for finding new drugs by screening
candidate compounds and also essential to
understand the causes of side effects in existing
drugs (Mei et al., 2013; Hong et al., 2017).
Currently, the latest computational models have
been discovered in predicting of potential
compound-protein interactions, including deep
learning techniques (Tsubaki et al., 2019).
However, at this moment, there are only a
few studies available to understand the
interaction between compounds and proteins.
For example, PubChem and ChEMBL database
store 90 million drug candidate compound
records, but some compounds interaction to
protein targets are still limited (Wang et al.,
2017; Mendez et al., 2019). The computational
method for predicting the CPI is thus essential
in drug or herbal medicine studies. The method
can reduce time, cost, and failure rate for
discovering new drugs or herbal medicines
(Kim et al., 2013).
To address the above issue, some studies
on CPI predictions have been conducted by
Biopharmaca Research Centre in Bogor,
Indonesia. Indonesia Jamu Herbs (IJAH)
webserver is developed by Biopharmaca
Research Center to predict the efficacy of
herbal of drug formulas for various diseases
using the multicomponent-multitarget network