Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? Alexandre Varnek* , and Igor Baskin , Laboratoire dInfochimie, UMR 7177 CNRS, Universite ́ de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France Department of Chemistry, Moscow State University, Moscow 119991, Russia ABSTRACT: This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or not at all in chemoinformatics. Machine learning methods are characterized in terms of the modes of statistical inferenceand modeling levelsnomenclature and by considering dierent facets of the modeling with respect to input/ouput matching, data types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide ecient solutions of common problems in chemoinformatics: improvement of predictive performance of structureproperty (activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties with functional endpoints (e.g., phase diagrams and doseresponse curves), and accounting for multiple molecular species (e.g., conformers or tautomers). 1. INTRODUCTION Over the last 30 years, the area of machine learning (statistical learning or data mining) has undergone signicant changes comparable with the revolution in physics at the beginning of the 20th century. The main problem in classical mathematical statistics concerns the inability to answer the simplequestion: Why does a model that perfectly fits the training data lead sometimes to incorrect predictions for the independent test set? Classical statistics in fact guarantees correct predictions only asymptotically, i.e., for innitely large training sets. Fischers parametric statistics requires the identication in advance of both relationships between the input and output data and the probability distributions of data. It species a few free parameters of those relationships and distributions to be found in the statistical study. More recent nonparametric statistics does not require exact model specication, but it is restricted to data of low dimensionality because of the curse of dimensionality. 1 These limitations are too restrictive to allow solution of most real-world problems. Nowadays, the fundamental paradigm of statistical analysis has changed from system identication(in which the aim is to reconstruct true probability distributions as the necessary step to achieve good predictive performance) to predictive modeling(in which simple, although not necessarily correct, probability distribu- tions or/and decision functions are used to build models with the highest predictive performance in the area occupied by actual data). 2 The new paradigm rst employed with articial neural networks 3,4 received theoretical backing through the development of new statistical theories capable of dealing with small data sets and oriented toward predictions: the statistical learning theory of Vapnik, 5,6 PAC (Probably Approximately Correct) theory of Valiant, 7 minimum description length concept of Rissanen, 8 and some others. Chemoinformatics, an area at the interface of chemistry and informatics, 914 is constantly exposed to the evolution in statistics and machine learning. The penetration of new data mining approaches into chemoinformatics has sometimes been the result of short-lived enthusiasm for novel methods, as with neural networks and support vector machines. A reection in chemoinformatics of the last crisis in statistics was the appearance of publications expressing disappointment in the capacity of QSAR/QSPR and similarity search methods to provide reliable predictions. 15 This is not unexpected given that instead of treating congeneric data sets one should be able to base models on very small (issuing from costly experiments) or very large (issuing from screening campaigns) structurally diverse data sets. The models developed on the limited size training sets should be applicable in virtual screening or for annotation of large databases. Thus, a subset of compounds should be identied to which the model can be applied with good predictive performance, i.e., by dening the models applicability domain (AD). Despite the large number of publications devoted to AD, this problem is still far from being resolved. The development of predictive tools for drug design is a major stimulus for the generation of experimental data, specically for model development. The question is how to construct the optimaltraining set (size, composition) to build predictive models. In fact, predictive performance of the models is not the only problem to solve (Figure 1); there are others where the absence of appropriate machine learning methods represents a real bottleneck. This concerns the modeling of properties with functional endpoints (e.g., phase diagrams and doseresponse curves), accounting for multiple molecular species (e.g., conformers or tautomers 16 ) and direct generation of chemical structures (inverse QSAR 1726 ). There is also a fundamental problem of descriptors derived from molecular structures. There is in general a loss of information resulting from the representation of a molecular structure by a xed number of descriptors. Therefore, the Received: September 1, 2011 Published: May 14, 2012 Perspective pubs.acs.org/jcim © 2012 American Chemical Society 1413 dx.doi.org/10.1021/ci200409x | J. Chem. Inf. Model. 2012, 52, 14131437