Upper Bound on Pattern Storage in Feedforward Networks Pramod L. Narasimha, Michael T. Manry and Francisco Maldonado Abstract—Starting from the strict interpolation equations for multivariate polynomials, an upper bound is developed for the number of patterns that can be memorized by a nonlinear feedforward network. A straightforward proof by contradiction is presented for the upper bound. It is shown that the hidden activations do not have to be analytic. Networks, trained by conjugate gradient, are used to demonstrate the tightness of the bound for random patterns. Based upon the upper bound, small multilayer perceptron models are successfully demonstrated for large support vector machines. I. I NTRODUCTION Pattern memorization in nonlinear networks has been studied for many decades. The number of patterns that can be memorized has been referred to as the information capacity [2] and storage capacity [18]. Equating network outputs to desired outputs has been referred to as strict interpolation [5], [20], [7]. It is important to understand the pattern memorization capability of feedforward networks for several reasons. First, the capability to memorize is related to the ability to form arbitrary shapes in weight space. Second, if a network can successfully memorize many random patterns, we know that the training algorithm is powerful [16]. Third, some useful feedforward networks such as Support Vector Machines (SVMs), memorize large numbers of training patterns [10], [11]. Upper bounds on the number of distinct patterns P that can be memorized by nonlinear feedforward networks are functions of the number of weights in the network, N w , and the number of outputs, M . For example, Davis [5] has shown that for any P distinct, complex points there exists a unique (P −1) degree polynomial, with complex coefﬁcients, that strictly interpolates (memorizes) all the points. In other words, breaking up the complex quantities into separate real and imaginary parts, he has derived a bound for the M =2 case. An upper bound on the number of hidden units in the Multilayer Perceptron (MLP) for the M =1 case, derived by Elisseeff and Moisy [6], agrees with the bound of Davis. Suyari and Matsuba [21] have derived the storage capacity of neural networks with binary weights, using minimum distance between the patterns. Cosnard et al. [4] have derived upper and lower bound on the size of nets capable of computing arbitrary dichotomies. Ji and Psaltis [13] have derived upper and lower bounds for the in- formation capacity of two-layer feedforward neural networks with binary interconnections, using an approach similar to that of Baum [3]. Moussaoui [1] and Ma and Ji [17] have Pramod L. Narasimha and Michael T. Manry are with the Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX 76013, USA (email: pramod.narasimha@uta.edu; manry@uta.edu). Francisco Maldonado is with Williams Pyro, Inc., 200 Greenleaf Street , Fort Worth , Texas 76107 . (email: javier.maldonado@williams-pyro.com) pointed out that the information capacity is reﬂected in the number of weights of the network. Unfortunately, most recent research on pattern memoriza- tion in feedforward networks focuses on the one output case. In this paper, partially building upon the work of Davis [5], we investigate an upper bound for M ≥ 1 and arbitrary hidden unit activation functions. In section II, we introduce our notation. A straightforward proof of the upper bound is given in section III. An example which indicates the validity of the bound is presented in section IV. In section V, we use the upper bound to predict the size of MLPs that can mimic the training behavior of SVMs. II. NOTATION AND PRELIMINARIES A. Notation Let {x p , t p } P p=1 be the data set where x p ∈ R N is the input vector and t p ∈ R M is the desired output vector and P is the number of patterns. Let us consider a feedforward MLP, having N inputs, one hidden layer with h nonlinear units and an output layer with M linear units. For the p th pattern, the j th hidden unit’s net function and activation are respectively net pj = N+1  i=1 w h (j, i) · x pi 1 ≤ p ≤ P, 1 ≤ j ≤ h (1) O pj = f (net pj ) (2) Here, the activation f (net) is a nonlinear function of the net function. The weight w h (j, i) connects the i th input to the j th hidden unit. Here the threshold of the j th node is represented by w h (j, N +1) and is handled by ﬁxing x p,N+1 to one. The k th output for the p th pattern is given by y pk = N+1  i=1 w oi (k,i) · x pi + h  j=1 w oh (k,j ) · O pj (3) where 1 ≤ k ≤ M . For the p th pattern, the N input values are x pi ,(1 ≤ i ≤ N ) and the M desired output values are t pk (1 ≤ k ≤ M ). w oi are the weights connecting inputs to outputs and w oh are the weights connecting hidden units to outputs. B. Review A feedforward network is said to have memorized a dataset if for every pattern, the network outputs are exactly equal to the desired outputs. Storage capacity of a feedforward network is the number (P ) of distinct input vectors that can be mapped, exactly, to the corresponding desired output vectors resulting in zero error. 1-4244-1380-X/07/$25.00 ©2007 IEEE Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August 12-17, 2007