IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 4, JULY 1999 907 Brief Papers Training Multilayer Perceptron Classifiers Based on a Modified Support Vector Method J. A. K. Suykens and J. Vandewalle Abstract— In this paper we describe a training method for one hidden layer multilayer perceptron classifier which is based on the idea of support vector machines (SVM’s). An upper bound on the Vapnik–Chervonenkis (VC) dimension is iteratively minimized over the interconnection matrix of the hidden layer and its bias vector. The output weights are determined according to the support vector method, but without making use of the classifier form which is related to Mercer’s condition. The method is illustrated on a two-spiral classification problem. Index Terms— Classification, multilayer perceptrons, support vector machines. I. INTRODUCTION I T IS well known that multilayer perceptrons (MLP’s) are universal in the sense that they can approximate any continuous nonlinear function arbitrarily well on a compact interval. As a result MLP’s became popular in order to parametrize nonlinear models and classifiers, often leading to improved results compared to classical methods [1], [2], [5], [10], [16]. One of the major drawbacks is that for batch training of MLP’s one usually solves a nonlinear optimization problem which has many local minima. Recently, support vector machines (SVM’s) have been introduced for which classification and function estimation problems are formulated as quadratic programming (QP) problems [12]–[15]. The idea of SVM originates from finding an optimal hyperplane in order to separate two classes with maximal margin. It has been extended later to one-hidden layer multilayer perceptrons, radial basis function networks, and other architectures. Being based on the structural risk minimization principle and capac- ity concept with pure combinatorial definitions, the quality and complexity of the SVM solution does not depend directly on the dimensionality of the input space. Manuscript received September 1, 1998; revised February 18, 1999. This work was carried out at the ESAT Laboratory and the Interdisciplinary Center of Neural Networks ICNN of the Katholieke Universiteit Leuven, Belgium, in the framework of the FWO project G.0262.97 Learning and Optimization: An Interdisciplinary Approach, the Belgian Programme on Interuniversity Poles of Attraction, initiated by the Belgian State, Prime Minister’s Office for Science, Technology, and Culture (IUAP P4-02 & IUAP P4-24) and the Concerted Action Project MIPS (Modelbased Information Processing Systems) of the Flemish Community. The authors are with the Department of Electrical Engineering, Katholieke Universiteit Leuven, ESAT-SISTA, Kardinaal Mercierlaan 94, B-3001 Leuven (Heverlee), Belgium. J. A. K. Suykens is also with the National Fund for Scientific Research FWO, Flanders. Publisher Item Identifier S 1045-9227(99)05969-X. However, taking the case of a MLP-SVM, only the output weights of the MLP are found by solving the QP problem. The interconnection matrix is directly related to the training data points itself, up to two additional constants. Hence the overall problem of finding the output weights together with these additional constants is in fact nonconvex. The number of hidden units follows from solving the QP problem and is equal to the number of support vectors. In this paper we describe a modified support vector method approach for training a MLP with a given number of hidden units. An upper bound on the Vapnik–Chervonenkis (VC) dimension is iteratively mini- mized over the interconnection matrix and the bias vector of the hidden layer. The output weights are determined according to the support vector method. We illustrate the method on a two-spiral benchmark problem. An advantage of this approach compared to backpropagation is the optimization of the gen- eralization performance in terms of the upper bound on the VC dimension. In backpropagation one usually incorporates a regularization term (norm on interconnection weights vector or weight decay) in order to obtain an improved generalization performance, being related to the bias-variance tradeoff [1]. For MLP-SVM’s Mercer’s condition is not satisfied for all possible values of the hidden layer parameters and the SVM theory is less developed for this type of kernels than, e.g., for RBF kernels where additional links with regularization theory have been demonstrated [9]. The present method does not require the additional Mercer condition and could be applied to other activation functions than such as circular units [7]. On the other hand for the QP subproblem the matrix is not guaranteed to be positive definite (which is related to the fact that the QP solution is global and unique [3]), but the overall design problem is nonconvex anyway. While SVM’s have been successfully applied to large scale problems, the modified method is applicable to moderate size problems due to the fact that all the weights of the hidden layer have to be estimated instead of the two additional constants. Drawbacks of the proposed method are the high computational cost and the larger number of parameters in the hidden layer, compared to a standard SVM approach. This paper is organized as follows. In Section II we review some basic facts about support vector machines for classi- fication problems. In Section III we discuss the multilayer perceptron classifier with the modified support vector training method. In Section IV we give an example for a two-spiral classification problem. 1045–9227/99$10.00  1999 IEEE