Design of experiments applied to QSAR: ranking a set of compounds and establishing a statistical significance test J.M. Barroso, E. Besalu ´ * Institute of Computational Chemistry, University of Girona, Avda. Montilvi sn, E-17071 Girona, Spain Received 18 November 2004; accepted 20 January 2005 Available online 29 June 2005 Abstract A two level fractional factorial design is applied over a set of peptides and obtained a non-linear model accounting for the interactions. The model is used in order to establish a molecular ranking over a test set. The results on prediction are quantified by a probabilistic statistical test of significance. q 2005 Elsevier B.V. All rights reserved. Keywords: Design of experiments; QSAR; Molecular ranking; Statistical significance test; Peptide design 1. Introduction Design of experiments (DOE) [1] constitutes a well- known statistical methodology which is able to extract relevant information from experimental data in order to redirect an investigation course. Originally, DOE was intended to the applied to industrial or engineering fields, but the range of action was spread rapidly across other scientific panoramas. QSPR and QSAR fields were not an exception and some authors considered DOE as a useful tool in order to optimize molecular properties [2–4]. In this context, the technique has been proven to be very useful and, some times, necessary in order to grasp simultaneous, synergic and non-linear effects. For instance, it can be read in reference [5] a comment relative to peptide chemistry which can be extended to other QSAR areas: ‘The intuitive way to select a set of peptide analogues is to change one amino acid position at a time. This ‘design’, or rather lack of design, is inefficient. This is because the resulting data will not contain any information about the joint influence of the substituted positions on the peptide activity. This inefficiency of ‘one feature at a time’ designs is well-known in chemical engineering and statistics but seems to be unrecognized in peptide chemistry’. In particular, many of the basic DOE approaches developed in QSAR deal with the concept of principal properties [2–6]. A principal property is codified by means of a numerical descriptor attached to a single or latent variable of the problem. It is assumed that principal properties collect the most relevant information (generally structural) of the molecules or objects, which are under study. Normally, a partial least squares (PLS) [7] latent variable or a principal component analysis (PCA) [8] vector is identified as a principal property. This particular codification allows to sort the described objects (molecules, substituents, solvents,.) according to the particular principal property value. This situation allowed the application of full or fractional factorial designs (FD) over two or more levels per variable. The numerical codification has a clear advantage: to automatically sort the interplaying elements in the calculations. On the other hand, this ranking is subordinated to the particular descriptors, which have been considered in the definition of the principal properties. In other words, it has to be recognized that numeric codification is a modelable feature which is susceptible to preconditionate the results obtained by the DOE treatment protocol. In this way, the whole chemometric process may Journal of Molecular Structure: THEOCHEM 727 (2005) 89–96 www.elsevier.com/locate/theochem 0166-1280/$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.theochem.2005.02.051 * Corresponding author. Tel.: C34 972 418875; fax: C34 972 418150. E-mail address: emili@iqc.udg.es (E. Besalu ´).