Toward a Principled Methodology for Neural Network Design and Performance Evaluation in QSAR. Application to the Prediction of LogP A. F. Duprat, † T. Huynh, ‡ and G. Dreyfus* ,‡ Laboratoire de Recherches Organiques and Laboratoire d’E Ä lectronique, E Ä cole Supe ´rieure de Physique et de Chimie Industrielles, 10 rue Vauquelin, 75231 Paris Cedex 05, France Received October 26, 1997 The prediction of properties of molecules from their structure (QSAR) is basically a nonlinear regression problem. Neural networks are proven to be parsimonious universal approximators of nonlinear functions; therefore, they are excellent candidates for performing the nonlinear regression tasks involved in QSAR. However, their full potential can be exploited only in the framework of a rigorous approach. In the present paper, we describe a principled methodology for designing neural networks for QSAR and estimating their performances, and we apply this approach to the prediction of logP. We compare our results to those obtained on the same molecules by other methods. 1. INTRODUCTION Neural networks are more and more widely used in QSAR as well as in various areas where data modeling is important. Unfortunately, the “biological” inspiration of these statistical tools too often obscures the basic issues involved in neural network design and application, in particular for QSAR applications (see ref 1 for a very valuable, lucid introductory textbook on neural nets). Therefore, the first part of the present paper is devoted to recalling briefly basic princi- plesssome of which are not specific to neural networkssthat are frequently overlooked. We insist on the fact that the sole justification of using neural networks for nonlinear regression is their parsimony. In the second part, we summarize briefly the steps to be taken in the design, training, and performance evaluation of a neural network for nonlinear regression. In the third part, we introduce a simple construc- tive method, based on first principles, for the selection of the variables of a neural model. Finally, we illustrate these principles by the prediction of logP; we compare the results obtained by our approach to those obtained by conventional regression techniques and demonstrate that, as expected from theoretical results, the parsimony of neural networks allows them to make a better use of the available data than polynomial regression. We also apply our model selection method and show that it allows us to effectively discriminate relevant descriptors from irrelevant ones. 2. ELEMENTS OF A PRINCIPLED APPROACH TO DATA MODELING WITH NEURAL NETWORKS Because of their biological inspiration, neural networks are usually defined as a set of connected nonlinear elements, as shown on Figure 1. This view, however, is both useless and misleading. Neural networks, as used in QSAR, and, more generally, in data modeling applications, have nothing to do whatsoever with the way the brain works; they should be considered as just another family of parameterized nonlinear functions which, like polynomials, wavelets, Fourier series, radial basis functions, splines, etc., are nonlinear approximators; 2 some neural networks do have, however, a specific advantage over other families of param- eterized functions, as will be indicated below. In the framework of statistical data modeling, which is precisely that in which neural networks are used for QSAR, these nonlinear functions are intended to approximate the regres- sion function of the predicted property, i.e., the expectation value of the latter (viewed as a random variable) conditional to the set of variables of the model (the descriptors of the molecules in QSAR). Since the models (polynomials, neural networks, wavelets, radial functions, etc.) are parameterized functions, the goal of modeling is the following: estimate the values of the parameters of the model which best predicts the data. The difficulty of the task lies in the fact that a † Laboratoire de Recherches Organiques. ‡ Laboratoire d’E Ä lectronique. Figure 1. A multilayer perceptron (a special class of feedforward neural networks), with a layer of H “hidden” neurons and a single, linear output neuron. The output of the hidden neuron i is given by y i ) f[∑ j)1 D θ ij d j ], where {d j , j ) 1 to D} is the set of descriptors, {θ ij , j ) 1 to D) is a set of parameters, and where f(.) ) tanh(.). The output of the network is given by y ) ∑ k)1 H θ k y k , where {θ k , k ) 1 to H} is a set of parameters. The output neuron performs a linear combination of the outputs of the hidden neurons, which are nonlinear combinations of the input descriptors; adjustable weights are present in both connection layers, so that the output is nonlinear with respect to the weights of the first layer of connec- tions. Each neuron has an additional, constant, input (usually termed “bias”) which is not shown. 586 J. Chem. Inf. Comput. Sci. 1998, 38, 586-594 S0095-2338(98)00042-0 CCC: $15.00 © 1998 American Chemical Society Published on Web 06/04/1998