Weight-decay regularization in Reproducing Kernel Hilbert Spaces by variable-basis schemes GIORGIO GNECCO Department of Computer and Information Science (DISI) University of Genoa Via Dodecaneso, 35, 16146 Genova ITALY giorgio.gnecco@dist.unige.it MARCELLO SANGUINETI Department of Communications, Computer, and System Sciences (DIST) University of Genoa Via Opera Pia 13, 16145 Genova, Italy ITALY marcello@dist.unige.it Abstract: The optimization problems associated with various regularization techniques for supervised learning from data (e.g., weight-decay and Tikhonov regularization) are described in the context of Re- producing Kernel Hilbert Spaces. Suboptimal solutions expressed by sparse kernel models with a given upper bound on the number of kernel computational units are investigated. Improvements of some es- timates obtained in Comput. Manag. Sci., vol. 6, pp. 53-79, 2009 are derived. Relationships between sparseness and generalization are discussed. Key–Words: Learning from data, regularization, weight decay, suboptimal solutions, rates of approxima- tion. 1 Introduction In supervised learning, an unknown input-output mapping has to be learned on the basis of a sample of input-output data [1]. The problem of approx- imating a function on the basis of a data sample z = {(x i ,y i ) X × R,i =1,...,m} is often ill- posed [2, 3]. Regularization [4] can be used to cope with this drawback. Among regularization techniques, weight de- cay (see, e.g., [5]) is a learning technique that pe- nalizes large values of the parameters (weights) of the model to be learned. For linear regression problems, the performance of weight decay was theoretically investigated in [5], where the case of linearization of a nonlinear model was considered, too. As to nonlinear models, a theoretical motiva- tion of the generalization performance of certain neural networks trained through weight decay was given in [6], where the case of binary classification problems was studied using tools from Statistical Learning Theory. In this paper, we study the optimization prob- lems associated with the weight-decay and other learning techniques. Each problem is formu- lated as the minimization of a regularized em- pirical error functional over a suitable hypothesis space. Then, we compare the solution provided to the learning problem by weight-decay regular- ization with the solution given by the classical Tikhonov’s regularization and a mixed regulariza- tion technique (i.e., weight decay combined with Tikhonov’s regularization). When one uses hy- pothesis spaces spanned by kernel functions im- plemented by computational units widely used in connectionistic models, the solution to the Tikhonov-regularized learning problem has the form of a linear combination of the m-tuple of the kernel functions, parameterized by the input data vector x =(x 1 ,...,x m ). The coefficients of the linear combination can be obtained by solv- ing a suitable linear system of equations, and this property can be exploited to develop learning al- gorithms. In order to simplify the analysis and emphasize the relationships between weight de- cay and Tikhonov’s regularization, also for the weight-decay learning problem and the mixed weight-decay/Tikhonov one we consider admissi- ble solutions belonging to linear combinations of kernel functions parameterized by the input data vectors. For these problems one can show [7] that the optimal solutions are obtained by solving sys- tems of linear equations, too. For large data sets, the use of a number of computational units equal to the number m of data may lead to very complex models and so may be computationally unfeasible. Moreover, practi- cal applications of linear algorithms using m com- WSEAS TRANSACTIONS on MATHEMATICS Giorgio Gnecco, Marcello Sanguineti ISSN: 1109-2769 625 ssue 11, Volume 8, November 2009