High-dimensional QSAR/QSPR classication modeling based on improving pigeon optimization algorithm Zakariya Yahya Algamal a, * , Maimoonah Khalid Qasim b , Muhammad Hisyam Lee c , Haithem Taha Mohammad Ali d a Department of Statistics and Informatics, University of Mosul, Mosul, Iraq b Department of General Science, University of Mosul, Mosul, Iraq c Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, Johor, Malaysia d College of Computers and Information Technology, Nawroz University, Kurdistan region, Iraq ARTICLE INFO Keywords: QSAR Pigeon optimization algorithm Evolutionary algorithm Transfer function Descriptors selection ABSTRACT High-dimensionality is one of the major problems which affect the quality of the quantitative structure-activity (property) relationship (QSAR/QSPR) classication methods in chemometrics. Applying variable selection is essential to improve the performance of the classication task. Variable selection is well-known as an NP-hard optimization problem. Various evolutionary algorithms are dedicated to solving this problem in the literature. Recently, a pigeon optimization algorithm was proposed, which has been successfully applied to solve various continuous optimization problems. In this paper, a new time-varying transfer function is proposed to improve the exploration and exploitation capability of the binary pigeon optimization algorithm in selecting the most relevant descriptors (variables) in QSAR/QSPR classication models with high classication accuracy and short computing time. Based on seven benchmark biopharmaceutical datasets, the experimental results reveal the capability of the proposed time-varying transfer function to achieve high classication accuracy with minimizing the number of selected descriptors and reducing the computational time. 1. Introduction In chemometrics, the quantitative structure-activity (property) rela- tionship (QSAR/QSPR) is a powerful and a promising model used to better understand the structural relationship between the chemical ac- tivity (property) and the chemical compounds by explicitly considering the mathematical, statistical, and informatical methods [14]. A common task in these models is the selection of relevant descriptors (variables), where researchers try to determine the smallest possible set of descriptors that can still achieve good predictive performance [417]. A typical data in QSAR/QSPR modeling consist of a small sample size of compounds (molecules) and a very large number of descriptors. Consequently, QSAR/QSPR modeling is challenged by the high dimensionality of the descriptors. In chemometrics, today, it is easily come out with thousands of mo- lecular descriptors, such as Dragon 7, which is commercial software. It can calculate 5270 molecular descriptors [18,19]. In high dimensional QSAR/QSPR modeling, where the number of descriptors, p, exceeds the number of compounds, n, the traditional statistical classication methods are not feasible [7,20]. In addition, the large number of descriptors can degrade the generalizable performance of the used classier or the pre- diction performance. Therefore, selecting descriptors that truly affect the biological activity is an attractive way in QSAR/QSPR modeling [21]. Variable (Descriptor) selections can be reported as a non-polynomial (NP) hard problem. The objective of variable selection is to provide faster and more effective models, and also to avoid overtting and the curse of dimensionality. Variable selection is a typical combinatorial optimization problem. A considerable effort has been devoted to developing variable selection procedures. With the development of computational intelli- gence, evolutionary algorithms, such as particle swarm optimization (PSO) [22], bat algorithm (BA) [23], and grey wolf optimization (GWO) [24], are the most effective and core technology to address high-dimensional data. The pigeon optimization algorithm (POA), which was proposed by Duan and Qiao [25], has certain outstanding merits, such as a simple computational process, simple implementation, and easy understanding * Corresponding author. E-mail addresses: zakariya.algamal@uomosul.edu.iq, zakariya.algamal@uomosul.edu.iq (Z.Y. Algamal), maimoonah.qasim@uomosul.edu.iq (M.K. Qasim), mhl@ utm.my (M.H. Lee), haithem.alyousif@nawroz.edu.krd (H.T. Mohammad Ali). Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics https://doi.org/10.1016/j.chemolab.2020.104170 Received 9 July 2020; Received in revised form 8 September 2020; Accepted 26 September 2020 Available online 28 September 2020 0169-7439/© 2020 Elsevier B.V. All rights reserved. Chemometrics and Intelligent Laboratory Systems 206 (2020) 104170