Machine Learning Methods for Property Prediction in
Chemoinformatics: Quo Vadis?
Alexandre Varnek*
,†
and Igor Baskin
†,‡
†
Laboratoire d’Infochimie, UMR 7177 CNRS, Universite ́ de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France
‡
Department of Chemistry, Moscow State University, Moscow 119991, Russia
ABSTRACT: This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or
not at all in chemoinformatics. Machine learning methods are characterized in terms of the “modes of statistical inference” and
“modeling levels” nomenclature and by considering different facets of the modeling with respect to input/ouput matching, data
types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide
efficient solutions of common problems in chemoinformatics: improvement of predictive performance of structure−property
(activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties
with functional endpoints (e.g., phase diagrams and dose−response curves), and accounting for multiple molecular species (e.g.,
conformers or tautomers).
1. INTRODUCTION
Over the last 30 years, the area of machine learning (statistical
learning or data mining) has undergone significant changes
comparable with the revolution in physics at the beginning of
the 20th century. The main problem in classical mathematical
statistics concerns the inability to answer the “simple” question:
Why does a model that perfectly fits the training data lead
sometimes to incorrect predictions for the independent test set?
Classical statistics in fact guarantees correct predictions only
asymptotically, i.e., for infinitely large training sets. Fischer’s
parametric statistics requires the identification in advance of
both relationships between the input and output data and the
probability distributions of data. It specifies a few free
parameters of those relationships and distributions to be
found in the statistical study. More recent nonparametric
statistics does not require exact model specification, but it is
restricted to data of low dimensionality because of the “curse of
dimensionality”.
1
These limitations are too restrictive to allow
solution of most real-world problems. Nowadays, the
fundamental paradigm of statistical analysis has changed from
“system identification” (in which the aim is to reconstruct true
probability distributions as the necessary step to achieve good
predictive performance) to “predictive modeling” (in which
simple, although not necessarily correct, probability distribu-
tions or/and decision functions are used to build models with
the highest predictive performance in the area occupied by
actual data).
2
The new paradigm first employed with artificial
neural networks
3,4
received theoretical backing through the
development of new statistical theories capable of dealing with
small data sets and oriented toward predictions: the statistical
learning theory of Vapnik,
5,6
PAC (Probably Approximately
Correct) theory of Valiant,
7
minimum description length
concept of Rissanen,
8
and some others.
Chemoinformatics, an area at the interface of chemistry and
informatics,
9−14
is constantly exposed to the evolution in
statistics and machine learning. The penetration of new data
mining approaches into chemoinformatics has sometimes been
the result of short-lived enthusiasm for novel methods, as with
neural networks and support vector machines. A reflection in
chemoinformatics of the last crisis in statistics was the
appearance of publications expressing disappointment in the
capacity of QSAR/QSPR and similarity search methods to
provide reliable predictions.
15
This is not unexpected given that
instead of treating congeneric data sets one should be able to
base models on very small (issuing from costly experiments) or
very large (issuing from screening campaigns) structurally
diverse data sets. The models developed on the limited size
training sets should be applicable in virtual screening or for
annotation of large databases. Thus, a subset of compounds
should be identified to which the model can be applied with
good predictive performance, i.e., by defining the model’s
applicability domain (AD). Despite the large number of
publications devoted to AD, this problem is still far from
being resolved.
The development of predictive tools for drug design is a
major stimulus for the generation of experimental data,
specifically for model development. The question is how to
construct the “optimal” training set (size, composition) to build
predictive models.
In fact, predictive performance of the models is not the only
problem to solve (Figure 1); there are others where the absence
of appropriate machine learning methods represents a real
bottleneck. This concerns the modeling of properties with
functional endpoints (e.g., phase diagrams and dose−response
curves), accounting for multiple molecular species (e.g.,
conformers or tautomers
16
) and direct generation of chemical
structures (“inverse QSAR”
17−26
).
There is also a fundamental problem of descriptors derived
from molecular structures. There is in general a loss of
information resulting from the representation of a molecular
structure by a fixed number of descriptors. Therefore, the
Received: September 1, 2011
Published: May 14, 2012
Perspective
pubs.acs.org/jcim
© 2012 American Chemical Society 1413 dx.doi.org/10.1021/ci200409x | J. Chem. Inf. Model. 2012, 52, 1413−1437