Benchmarking of Linear and Nonlinear Approaches for Quantitative Structure-Property Relationship Studies of Metal Complexation with Ionophores Igor V. Tetko ² Institute of Bioorganic & Petrochemistry, Kiev, Ukraine Vitaly P. Solov’ev Institute of Physical Chemistry, Russian Academy of Sciences, Leninskiy prospect 31a, 119991 Moscow, Russia Alexey V. Antonov Institute for Bioinformatics, Neuherberg D-85764, Germany Xiaojun Yao, Jean Pierre Doucet, and Botao Fan Universite ´ Paris 7-Denis Diderot, ITODYS-CNRS UMR 7086, 1, rue Guy de la Brosse, Paris 75005, France Frank Hoonakker, Denis Fourches, Piere Jost, Nicolas Lachiche, and Alexandre Varnek* Laboratoire d’Infochimie, UMR 7551 CNRS, Universite ´ Louis Pasteur, 4, rue B. Pascal, Strasbourg 67000, France Received September 24, 2005 A benchmark of several popular methods, Associative Neural Networks (ANN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), Maximal Margin Linear Programming (MMLP), Radial Basis Function Neural Network (RBFNN), and Multiple Linear Regression (MLR), is reported for quantitative-structure property relationships (QSPR) of stability constants logK 1 for the 1:1 (M:L) and log 2 for 1:2 complexes of metal cations Ag + and Eu 3+ with diverse sets of organic molecules in water at 298 K and ionic strength 0.1 M. The methods were tested on three types of descriptors: molecular descriptors including E-state values, counts of atoms determined for E-state atom types, and substructural molecular fragments (SMF). Comparison of the models was performed using a 5-fold external cross-validation procedure. Robust statistical tests (bootstrap and Kolmogorov-Smirnov statistics) were employed to evaluate the significance of calculated models. The Wilcoxon signed-rank test was used to compare the performance of methods. Individual structure-complexation property models obtained with nonlinear methods demonstrated a significantly better performance than the models built using multilinear regression analysis (MLRA). However, the averaging of several MLRA models based on SMF descriptors provided as good of a prediction as the most efficient nonlinear techniques. Support Vector Machines and Associative Neural Networks contributed in the largest number of significant models. Models based on fragments (SMF descriptors and E-state counts) had higher prediction ability than those based on E-state indices. The use of SMF descriptors and E-state counts provided similar results, whereas E-state indices lead to less significant models. The current study illustrates the difficulties of quantitative comparison of different methods: conclusions based only on one data set without appropriate statistical tests could be wrong. INTRODUCTION An important branch of supramolecular chemistry is the chemistry of ionophore-molecules possessing high affinity toward metal cations in solutions. Their ability to bind cations selectively is widely used in practice for the separation and concentration of metals (solvent extraction) and in analytical devices (ion-selective electrodes, CHEMFETs, etc.). 1 Experimental measurements of stability constants of iono- phore-metal complexes and related free energies of com- plexation reactions represent rather difficult and costly tasks. That is why a theoretical quantitative estimation of complexes stabilities might become an important complement of experimental studies thus providing researchers a way to reduce the number of experiments and to indicate the strategy of “optimization” of known metal binders. The thermodynamic complexation properties depend on many parameters: the nature of the metal, structure of ionophore, solvent, conterion(s), temperature, and back- ground compounds. In experiments, even small inaccuracies in measuring species concentration or temperature may lead to errors in complexation constants up to several log units. 2,3 One can mention different theoretical approaches to assess free energies of complexation. Quantum Mechanics calcula- tions in the gas phase could be hardly recommended for these * Corresponding author e-mail: varnek@chimie.u-strasbg.fr. ² Current address: Institute for Bioinformatics, Neuherberg D-85764, Germany. http://www.vcclab.org. 808 J. Chem. Inf. Model. 2006, 46, 808-819 10.1021/ci0504216 CCC: $33.50 © 2006 American Chemical Society Published on Web 01/17/2006