Exploring the impact of size of training sets for the development of predictive QSAR models Partha Pratim Roy a , J. Thomas Leonard b , Kunal Roy a, a Drug Theoretics and Cheminformatics Lab, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India b Department of Pharmaceutical Chemistry, KM College of Pharmacy, Madurai 625 107, India Received 8 December 2006; received in revised form 26 July 2007; accepted 31 July 2007 Available online 7 August 2007 Abstract While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n = 62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n = 107) and bioconcentration factor data of diverse functional compounds (n = 122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. © 2007 Elsevier B.V. All rights reserved. Keywords: QSAR; Validation; Training set size; K-means clusters; Stepwise regression; FA-MLR; PLS 1. Introduction Similar molecules with just a slight variation in their structures can exhibit either different magnitudes of a particular biological activity or quite different types of biological activities. This kind of relationship between molecular structure and changes in biological activity developed on a quantitative basis is the center of focus for the field of quantitative structureactivity relation- ships (QSAR). In the field of QSAR, the main objective is to investigate these relationships by building mathematical models that explain the relationship in a statistical way. QSARs are being applied in many disciplines like risk assessment, toxicity prediction, and regulatory decisions [1,2] apart from drug discovery and lead optimization [3]. The QSAR models are useful for various purposes including the prediction of activities of untested chemicals. The success of drug discovery efforts within the pharmaceutical industry depends heavily on utilization of SAR techniques for these and related purposes. A QSAR model's utility and, in the case of regulatory decisions, justification for usage increasingly depend on the ability to quantify a model's potential for predicting unknown chemicals with some known degree of certainty [4]. Over the years of development, many methods, algorithms and techniques have been discovered and applied in QSAR studies [5]. The challenge, therefore, is to select the group of descriptors that describe the most critical structural and physicochemical features associated with activity. Effective descriptor or variable selection is an integral part of the QSAR modeling process [6]. Obtaining a good quality QSAR model depends on many factors, such as the quality of biological data, the choice of descriptors and statistical Available online at www.sciencedirect.com Chemometrics and Intelligent Laboratory Systems 90 (2008) 31 42 www.elsevier.com/locate/chemolab Corresponding author. Fax: +91 33 2837 1078. E-mail address: kunalroy_in@yahoo.com (K. Roy). URL: http://www.geocities.com/kunalroy_in (K. Roy). 0169-7439/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2007.07.004