High-dimensional QSAR/QSPR classification modeling based on improving
pigeon optimization algorithm
Zakariya Yahya Algamal
a, *
, Maimoonah Khalid Qasim
b
, Muhammad Hisyam Lee
c
,
Haithem Taha Mohammad Ali
d
a
Department of Statistics and Informatics, University of Mosul, Mosul, Iraq
b
Department of General Science, University of Mosul, Mosul, Iraq
c
Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, Johor, Malaysia
d
College of Computers and Information Technology, Nawroz University, Kurdistan region, Iraq
ARTICLE INFO
Keywords:
QSAR
Pigeon optimization algorithm
Evolutionary algorithm
Transfer function
Descriptors selection
ABSTRACT
High-dimensionality is one of the major problems which affect the quality of the quantitative structure-activity
(property) relationship (QSAR/QSPR) classification methods in chemometrics. Applying variable selection is
essential to improve the performance of the classification task. Variable selection is well-known as an NP-hard
optimization problem. Various evolutionary algorithms are dedicated to solving this problem in the literature.
Recently, a pigeon optimization algorithm was proposed, which has been successfully applied to solve various
continuous optimization problems. In this paper, a new time-varying transfer function is proposed to improve the
exploration and exploitation capability of the binary pigeon optimization algorithm in selecting the most relevant
descriptors (variables) in QSAR/QSPR classification models with high classification accuracy and short computing
time. Based on seven benchmark biopharmaceutical datasets, the experimental results reveal the capability of the
proposed time-varying transfer function to achieve high classification accuracy with minimizing the number of
selected descriptors and reducing the computational time.
1. Introduction
In chemometrics, the quantitative structure-activity (property) rela-
tionship (QSAR/QSPR) is a powerful and a promising model used to
better understand the structural relationship between the chemical ac-
tivity (property) and the chemical compounds by explicitly considering
the mathematical, statistical, and informatical methods [1–4]. A common
task in these models is the selection of relevant descriptors (variables),
where researchers try to determine the smallest possible set of descriptors
that can still achieve good predictive performance [4–17]. A typical data
in QSAR/QSPR modeling consist of a small sample size of compounds
(molecules) and a very large number of descriptors. Consequently,
QSAR/QSPR modeling is challenged by the high dimensionality of the
descriptors.
In chemometrics, today, it is easily come out with thousands of mo-
lecular descriptors, such as Dragon 7, which is commercial software. It
can calculate 5270 molecular descriptors [18,19]. In high dimensional
QSAR/QSPR modeling, where the number of descriptors, p, exceeds the
number of compounds, n, the traditional statistical classification methods
are not feasible [7,20]. In addition, the large number of descriptors can
degrade the generalizable performance of the used classifier or the pre-
diction performance. Therefore, selecting descriptors that truly affect the
biological activity is an attractive way in QSAR/QSPR modeling [21].
Variable (Descriptor) selections can be reported as a non-polynomial
(NP) hard problem. The objective of variable selection is to provide faster
and more effective models, and also to avoid overfitting and the curse of
dimensionality. Variable selection is a typical combinatorial optimization
problem. A considerable effort has been devoted to developing variable
selection procedures. With the development of computational intelli-
gence, evolutionary algorithms, such as particle swarm optimization
(PSO) [22], bat algorithm (BA) [23], and grey wolf optimization (GWO)
[24], are the most effective and core technology to address
high-dimensional data.
The pigeon optimization algorithm (POA), which was proposed by
Duan and Qiao [25], has certain outstanding merits, such as a simple
computational process, simple implementation, and easy understanding
* Corresponding author.
E-mail addresses: zakariya.algamal@uomosul.edu.iq, zakariya.algamal@uomosul.edu.iq (Z.Y. Algamal), maimoonah.qasim@uomosul.edu.iq (M.K. Qasim), mhl@
utm.my (M.H. Lee), haithem.alyousif@nawroz.edu.krd (H.T. Mohammad Ali).
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemometrics
https://doi.org/10.1016/j.chemolab.2020.104170
Received 9 July 2020; Received in revised form 8 September 2020; Accepted 26 September 2020
Available online 28 September 2020
0169-7439/© 2020 Elsevier B.V. All rights reserved.
Chemometrics and Intelligent Laboratory Systems 206 (2020) 104170