PLS-Optimal: A Stepwise D-Optimal Design Based on Latent
Variables
Stefan Brandmaier,*
,†,‡
Ullrika Sahlin,
†
Igor V. Tetko,
‡,§
and Tomas O
̈
berg
†
†
School of Natural Sciences, Linnaeus University, 391 82 Kalmar, Sweden
‡
Helmholtz Zentrum Mü nchen - German Research Center for Environmental Health (GmbH), Institute of Structural Biology,
Ingolstaedter Landstrasse 1, Neuherberg D-85764, Germany
§
eADMET GmbH, Ingolstaedter Landstrasse 1, Neuherberg D-85764, Germany
ABSTRACT: Several applications, such as risk assessment
within REACH or drug discovery, require reliable methods for
the design of experiments and efficient testing strategies.
Keeping the number of experiments as low as possible is
important from both a financial and an ethical point of view, as
exhaustive testing of compounds requires significant financial
resources and animal lives. With a large initial set of
compounds, experimental design techniques can be used to
select a representative subset for testing. Once measured, these
compounds can be used to develop quantitative structure-activity relationship models to predict properties of the remaining
compounds. This reduces the required resources and time. D-Optimal design is frequently used to select an optimal set of
compounds by analyzing data variance. We developed a new sequential approach to apply a D-Optimal design to latent variables
derived from a partial least squares (PLS) model instead of principal components. The stepwise procedure selects a new set of
molecules to be measured after each previous measurement cycle. We show that application of the D-Optimal selection generates
models with a significantly improved performance on four different data sets with end points relevant for REACH. Compared to
those derived from principal components, PLS models derived from the selection on latent variables had a lower root-mean-
square error and a higher Q2 and R2. This improvement is statistically significant, especially for the small number of compounds
selected.
1. INTRODUCTION
The REACH legislation
1
includes the requirement that every
chemical compound produced in or imported to the European
Union in an amount of more than one ton has to be registered
regarding a number of end points. Experimental determination
of these properties for all compounds would require high-
throughput testing. According to Rovida and Hartung, the
financial requirements for such testing are about €9.5 billion.
2
For potentially hazardous, dangerous, or hardly degradable
substances, registration also requires information about their
bioaccumulation and toxicity. Apart from cost and time
efficiency, a sample, for example, bioconcentration, requires
around two months and can cost more than €200this also
leads to ethical problems, as experimental determination of end
points associated with toxicity and bioaccumulation is achieved
by animal tests.
The necessity to keep the overhead of (animal) testing as low
as possible is also important in many other research areas, for
example, the chemical or pharmaceutical industries. One
common strategy to address this problem is to use
structure-activity modeling
3
and to predict the required
properties rather than performing experimental measurements.
This strategy entails testing only a small subset of all the
compounds of interest and constructing a predictive model
using the experimentally determined values. This basic task can
be reduced to the problem of drawing a representative
subsample of a larger set. This method is important in other
fields of research, e.g., quantitative structure-activity relation-
ship (QSAR) development,
4
large-scale database scanning,
5
in
silico drug design,
6
and compound prioritization,
7
as well as in
experimental design for risk assessment within REACH.
8
There are several commonly accepted approaches
9-13
for
choosing a representative subset of compounds to deliver the
most reliable model. These approaches select the subset
according to various criteria. Partition-based approaches, like
full or factorial design, attempt to select a sample that is
representative of the whole chemical space of interest,
separating the descriptor space into subspaces and finding a
representative compound for each of these subspaces.
14
Other
approaches aim to find the subset that is most descriptive for
the remaining compounds by ranking the representativity of
compounds according to their pairwise distance in descriptor
space.
15,16
D-Optimal design, which has been recommended as the
favorable alternative for linear models in several publica-
tions,
17,18
selects the most representative combination of
compounds for linear models.
19
In this method, each possible
Received: January 11, 2012
Published: March 30, 2012
Article
pubs.acs.org/jcim
© 2012 American Chemical Society 975 dx.doi.org/10.1021/ci3000198 | J. Chem. Inf. Model. 2012, 52, 975-983