Exploring the impact of size of training sets for the development
of predictive QSAR models
Partha Pratim Roy
a
, J. Thomas Leonard
b
, Kunal Roy
a,
⁎
a
Drug Theoretics and Cheminformatics Lab, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology,
Jadavpur University, Kolkata 700 032, India
b
Department of Pharmaceutical Chemistry, KM College of Pharmacy, Madurai 625 107, India
Received 8 December 2006; received in revised form 26 July 2007; accepted 31 July 2007
Available online 7 August 2007
Abstract
While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task.
However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external
validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have
attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical
techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates
(n = 62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n = 107) and
bioconcentration factor data of diverse functional compounds (n = 122). In each case, the data set was divided into different combinations of
training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of
training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second
one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found.
Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set
should be set based on a particular data set and types of descriptors and statistical analysis being used.
© 2007 Elsevier B.V. All rights reserved.
Keywords: QSAR; Validation; Training set size; K-means clusters; Stepwise regression; FA-MLR; PLS
1. Introduction
Similar molecules with just a slight variation in their structures
can exhibit either different magnitudes of a particular biological
activity or quite different types of biological activities. This kind
of relationship between molecular structure and changes in
biological activity developed on a quantitative basis is the center
of focus for the field of quantitative structure–activity relation-
ships (QSAR). In the field of QSAR, the main objective is to
investigate these relationships by building mathematical models
that explain the relationship in a statistical way. QSARs are being
applied in many disciplines like risk assessment, toxicity
prediction, and regulatory decisions [1,2] apart from drug
discovery and lead optimization [3]. The QSAR models are
useful for various purposes including the prediction of activities
of untested chemicals. The success of drug discovery efforts
within the pharmaceutical industry depends heavily on utilization
of SAR techniques for these and related purposes. A QSAR
model's utility and, in the case of regulatory decisions,
justification for usage increasingly depend on the ability to
quantify a model's potential for predicting unknown chemicals
with some known degree of certainty [4]. Over the years of
development, many methods, algorithms and techniques have
been discovered and applied in QSAR studies [5]. The challenge,
therefore, is to select the group of descriptors that describe the
most critical structural and physicochemical features associated
with activity. Effective descriptor or variable selection is an
integral part of the QSAR modeling process [6]. Obtaining a good
quality QSAR model depends on many factors, such as the
quality of biological data, the choice of descriptors and statistical
Available online at www.sciencedirect.com
Chemometrics and Intelligent Laboratory Systems 90 (2008) 31 – 42
www.elsevier.com/locate/chemolab
⁎
Corresponding author. Fax: +91 33 2837 1078.
E-mail address: kunalroy_in@yahoo.com (K. Roy).
URL: http://www.geocities.com/kunalroy_in (K. Roy).
0169-7439/$ - see front matter © 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2007.07.004