Neural Processing Letters 9: 53–61, 1999. © 1999 Kluwer Academic Publishers. Printed in the Netherlands. 53 Initialization of Supervised Training for Parametric Estimation P. COSTA and P. LARZABAL LESIR-ENS de Cachan, 61 av. Du Président Wilson, 94235 Cachan Cedex, France. E-mail: pascale.costa@lesir.ens-cachan.fr Abstract. This paper concerns the initialization problem of the training algorithm in Neural Net- works. We focus herein on backpropagation networks with one hidden layer. The initialization of the weights is crucial; if the network is incorrectly initialized, it converges to local minima. The classical random initialization therefore appears as a very poor solution. If we were to consider the Taylor development of the mapping problem and the nonlinearity of sigmoids, the improvements could be very signiﬁcant. We propose a new initialization scheme based on the search for an explicit approximate solution to the problem of mapping between pattern and target. Simulation results are presented which show that these original initializations avoid local minima, reduce training time, obtain a better generalization and estimate the network’s size. Key words: estimation, global convergence, initialization, multilayer perceptron 1. Introduction The problem of learning in Neural Networks is easily formulated in terms of the minimization of an error function. This error is a function of the adaptive parame- ters (weights and biases) in the network. The problem of minimizing continuous, differentiable functions of many variables is one which has been widely studied. Many of the conventional approaches to this problem are directly applicable to the training of Neural Networks. In order to apply an optimization algorithm to real problems, we need to address a variety of practical issues, as reported in the literature on backpropagation networks. The principal orientations of research in Multilayer Networks have focused on improving the optimization procedure (adap- tation of learning rate, second-order algorithm, modiﬁcation of network size during training, criteria used to terminate training, normalization of data, etc). Less emphasis has been placed on the initialization of the network (for example: [1], [2], [3], [4]...). Training algorithms usually begin by initializing the weights in the network to some randomly chosen values. An appropriate choice of initial weights is therefore potentially important in allowing the training algorithm to pro- duce a good set of weights as well as in leading to improvements in training speed. Even stochastic algorithms, such as gradient descent, which have the possibility of avoiding local minima, can show strong sensitivity to the initial condition.