Toward a Principled Methodology for Neural Network Design and Performance
Evaluation in QSAR. Application to the Prediction of LogP
A. F. Duprat,
†
T. Huynh,
‡
and G. Dreyfus*
,‡
Laboratoire de Recherches Organiques and Laboratoire d’E Ä lectronique, E Ä cole Supe ´rieure de Physique et de
Chimie Industrielles, 10 rue Vauquelin, 75231 Paris Cedex 05, France
Received October 26, 1997
The prediction of properties of molecules from their structure (QSAR) is basically a nonlinear regression
problem. Neural networks are proven to be parsimonious universal approximators of nonlinear functions;
therefore, they are excellent candidates for performing the nonlinear regression tasks involved in QSAR.
However, their full potential can be exploited only in the framework of a rigorous approach. In the present
paper, we describe a principled methodology for designing neural networks for QSAR and estimating their
performances, and we apply this approach to the prediction of logP. We compare our results to those
obtained on the same molecules by other methods.
1. INTRODUCTION
Neural networks are more and more widely used in QSAR
as well as in various areas where data modeling is important.
Unfortunately, the “biological” inspiration of these statistical
tools too often obscures the basic issues involved in neural
network design and application, in particular for QSAR
applications (see ref 1 for a very valuable, lucid introductory
textbook on neural nets). Therefore, the first part of the
present paper is devoted to recalling briefly basic princi-
plesssome of which are not specific to neural networkssthat
are frequently overlooked. We insist on the fact that the
sole justification of using neural networks for nonlinear
regression is their parsimony. In the second part, we
summarize briefly the steps to be taken in the design, training,
and performance evaluation of a neural network for nonlinear
regression. In the third part, we introduce a simple construc-
tive method, based on first principles, for the selection of
the variables of a neural model. Finally, we illustrate these
principles by the prediction of logP; we compare the results
obtained by our approach to those obtained by conventional
regression techniques and demonstrate that, as expected from
theoretical results, the parsimony of neural networks allows
them to make a better use of the available data than
polynomial regression. We also apply our model selection
method and show that it allows us to effectively discriminate
relevant descriptors from irrelevant ones.
2. ELEMENTS OF A PRINCIPLED APPROACH TO
DATA MODELING WITH NEURAL NETWORKS
Because of their biological inspiration, neural networks
are usually defined as a set of connected nonlinear elements,
as shown on Figure 1. This view, however, is both useless
and misleading. Neural networks, as used in QSAR, and,
more generally, in data modeling applications, have nothing
to do whatsoever with the way the brain works; they should
be considered as just another family of parameterized
nonlinear functions which, like polynomials, wavelets,
Fourier series, radial basis functions, splines, etc., are
nonlinear approximators;
2
some neural networks do have,
however, a specific advantage over other families of param-
eterized functions, as will be indicated below. In the
framework of statistical data modeling, which is precisely
that in which neural networks are used for QSAR, these
nonlinear functions are intended to approximate the regres-
sion function of the predicted property, i.e., the expectation
value of the latter (viewed as a random variable) conditional
to the set of variables of the model (the descriptors of the
molecules in QSAR). Since the models (polynomials, neural
networks, wavelets, radial functions, etc.) are parameterized
functions, the goal of modeling is the following: estimate
the values of the parameters of the model which best predicts
the data. The difficulty of the task lies in the fact that a
†
Laboratoire de Recherches Organiques.
‡
Laboratoire d’E Ä lectronique.
Figure 1. A multilayer perceptron (a special class of feedforward
neural networks), with a layer of H “hidden” neurons and a single,
linear output neuron. The output of the hidden neuron i is given
by y
i
) f[∑
j)1
D
θ
ij
d
j
], where {d
j
, j ) 1 to D} is the set of
descriptors, {θ
ij
, j ) 1 to D) is a set of parameters, and where f(.)
) tanh(.). The output of the network is given by y ) ∑
k)1
H
θ
k
y
k
,
where {θ
k
, k ) 1 to H} is a set of parameters. The output neuron
performs a linear combination of the outputs of the hidden neurons,
which are nonlinear combinations of the input descriptors; adjustable
weights are present in both connection layers, so that the output is
nonlinear with respect to the weights of the first layer of connec-
tions. Each neuron has an additional, constant, input (usually termed
“bias”) which is not shown.
586 J. Chem. Inf. Comput. Sci. 1998, 38, 586-594
S0095-2338(98)00042-0 CCC: $15.00 © 1998 American Chemical Society
Published on Web 06/04/1998