Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? Alexandre Varnek* ,† and Igor Baskin †,‡ † Laboratoire d’Infochimie, UMR 7177 CNRS, Universite ́ de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France ‡ Department of Chemistry, Moscow State University, Moscow 119991, Russia ABSTRACT: This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or not at all in chemoinformatics. Machine learning methods are characterized in terms of the “modes of statistical inference” and “modeling levels” nomenclature and by considering diﬀerent facets of the modeling with respect to input/ouput matching, data types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide eﬃcient solutions of common problems in chemoinformatics: improvement of predictive performance of structure−property (activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties with functional endpoints (e.g., phase diagrams and dose−response curves), and accounting for multiple molecular species (e.g., conformers or tautomers). 1. INTRODUCTION Over the last 30 years, the area of machine learning (statistical learning or data mining) has undergone signiﬁcant changes comparable with the revolution in physics at the beginning of the 20th century. The main problem in classical mathematical statistics concerns the inability to answer the “simple” question: Why does a model that perfectly fits the training data lead sometimes to incorrect predictions for the independent test set? Classical statistics in fact guarantees correct predictions only asymptotically, i.e., for inﬁnitely large training sets. Fischer’s parametric statistics requires the identiﬁcation in advance of both relationships between the input and output data and the probability distributions of data. It speciﬁes a few free parameters of those relationships and distributions to be found in the statistical study. More recent nonparametric statistics does not require exact model speciﬁcation, but it is restricted to data of low dimensionality because of the “curse of dimensionality”. 1 These limitations are too restrictive to allow solution of most real-world problems. Nowadays, the fundamental paradigm of statistical analysis has changed from “system identiﬁcation” (in which the aim is to reconstruct true probability distributions as the necessary step to achieve good predictive performance) to “predictive modeling” (in which simple, although not necessarily correct, probability distribu- tions or/and decision functions are used to build models with the highest predictive performance in the area occupied by actual data). 2 The new paradigm ﬁrst employed with artiﬁcial neural networks 3,4 received theoretical backing through the development of new statistical theories capable of dealing with small data sets and oriented toward predictions: the statistical learning theory of Vapnik, 5,6 PAC (Probably Approximately Correct) theory of Valiant, 7 minimum description length concept of Rissanen, 8 and some others. Chemoinformatics, an area at the interface of chemistry and informatics, 9−14 is constantly exposed to the evolution in statistics and machine learning. The penetration of new data mining approaches into chemoinformatics has sometimes been the result of short-lived enthusiasm for novel methods, as with neural networks and support vector machines. A reﬂection in chemoinformatics of the last crisis in statistics was the appearance of publications expressing disappointment in the capacity of QSAR/QSPR and similarity search methods to provide reliable predictions. 15 This is not unexpected given that instead of treating congeneric data sets one should be able to base models on very small (issuing from costly experiments) or very large (issuing from screening campaigns) structurally diverse data sets. The models developed on the limited size training sets should be applicable in virtual screening or for annotation of large databases. Thus, a subset of compounds should be identiﬁed to which the model can be applied with good predictive performance, i.e., by deﬁning the model’s applicability domain (AD). Despite the large number of publications devoted to AD, this problem is still far from being resolved. The development of predictive tools for drug design is a major stimulus for the generation of experimental data, speciﬁcally for model development. The question is how to construct the “optimal” training set (size, composition) to build predictive models. In fact, predictive performance of the models is not the only problem to solve (Figure 1); there are others where the absence of appropriate machine learning methods represents a real bottleneck. This concerns the modeling of properties with functional endpoints (e.g., phase diagrams and dose−response curves), accounting for multiple molecular species (e.g., conformers or tautomers 16 ) and direct generation of chemical structures (“inverse QSAR” 17−26 ). There is also a fundamental problem of descriptors derived from molecular structures. There is in general a loss of information resulting from the representation of a molecular structure by a ﬁxed number of descriptors. Therefore, the Received: September 1, 2011 Published: May 14, 2012 Perspective pubs.acs.org/jcim © 2012 American Chemical Society 1413 dx.doi.org/10.1021/ci200409x | J. Chem. Inf. Model. 2012, 52, 1413−1437