PLS-Optimal: A Stepwise D-Optimal Design Based on Latent Variables Stefan Brandmaier,* ,, Ullrika Sahlin, Igor V. Tetko, ,§ and Tomas O ̈ berg School of Natural Sciences, Linnaeus University, 391 82 Kalmar, Sweden Helmholtz Zentrum Mü nchen - German Research Center for Environmental Health (GmbH), Institute of Structural Biology, Ingolstaedter Landstrasse 1, Neuherberg D-85764, Germany § eADMET GmbH, Ingolstaedter Landstrasse 1, Neuherberg D-85764, Germany ABSTRACT: Several applications, such as risk assessment within REACH or drug discovery, require reliable methods for the design of experiments and ecient testing strategies. Keeping the number of experiments as low as possible is important from both a nancial and an ethical point of view, as exhaustive testing of compounds requires signicant nancial resources and animal lives. With a large initial set of compounds, experimental design techniques can be used to select a representative subset for testing. Once measured, these compounds can be used to develop quantitative structure-activity relationship models to predict properties of the remaining compounds. This reduces the required resources and time. D-Optimal design is frequently used to select an optimal set of compounds by analyzing data variance. We developed a new sequential approach to apply a D-Optimal design to latent variables derived from a partial least squares (PLS) model instead of principal components. The stepwise procedure selects a new set of molecules to be measured after each previous measurement cycle. We show that application of the D-Optimal selection generates models with a signicantly improved performance on four dierent data sets with end points relevant for REACH. Compared to those derived from principal components, PLS models derived from the selection on latent variables had a lower root-mean- square error and a higher Q2 and R2. This improvement is statistically signicant, especially for the small number of compounds selected. 1. INTRODUCTION The REACH legislation 1 includes the requirement that every chemical compound produced in or imported to the European Union in an amount of more than one ton has to be registered regarding a number of end points. Experimental determination of these properties for all compounds would require high- throughput testing. According to Rovida and Hartung, the nancial requirements for such testing are about 9.5 billion. 2 For potentially hazardous, dangerous, or hardly degradable substances, registration also requires information about their bioaccumulation and toxicity. Apart from cost and time eciency, a sample, for example, bioconcentration, requires around two months and can cost more than 200this also leads to ethical problems, as experimental determination of end points associated with toxicity and bioaccumulation is achieved by animal tests. The necessity to keep the overhead of (animal) testing as low as possible is also important in many other research areas, for example, the chemical or pharmaceutical industries. One common strategy to address this problem is to use structure-activity modeling 3 and to predict the required properties rather than performing experimental measurements. This strategy entails testing only a small subset of all the compounds of interest and constructing a predictive model using the experimentally determined values. This basic task can be reduced to the problem of drawing a representative subsample of a larger set. This method is important in other elds of research, e.g., quantitative structure-activity relation- ship (QSAR) development, 4 large-scale database scanning, 5 in silico drug design, 6 and compound prioritization, 7 as well as in experimental design for risk assessment within REACH. 8 There are several commonly accepted approaches 9-13 for choosing a representative subset of compounds to deliver the most reliable model. These approaches select the subset according to various criteria. Partition-based approaches, like full or factorial design, attempt to select a sample that is representative of the whole chemical space of interest, separating the descriptor space into subspaces and nding a representative compound for each of these subspaces. 14 Other approaches aim to nd the subset that is most descriptive for the remaining compounds by ranking the representativity of compounds according to their pairwise distance in descriptor space. 15,16 D-Optimal design, which has been recommended as the favorable alternative for linear models in several publica- tions, 17,18 selects the most representative combination of compounds for linear models. 19 In this method, each possible Received: January 11, 2012 Published: March 30, 2012 Article pubs.acs.org/jcim © 2012 American Chemical Society 975 dx.doi.org/10.1021/ci3000198 | J. Chem. Inf. Model. 2012, 52, 975-983