A DC-Programming Algorithm for Kernel Selection Andreas Argyriou a.argyriou@cs.ucl.ac.uk Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK Raphael Hauser raphael.hauser@comlab.ox.ac.uk Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, OX1 3QD, UK Charles A. Micchelli Department of Mathematics and Statistics, State University of New York, The University at Albany, 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano Pontil m.pontil@cs.ucl.ac.uk Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK Abstract We address the problem of learning a kernel for a given supervised learning task. Our ap- proach consists in searching within the con- vex hull of a prescribed set of basic kernels for one which minimizes a convex regularization functional. A unique feature of this approach compared to others in the literature is that the number of basic kernels can be infinite. We only require that they are continuously parameterized. For example, the basic ker- nels could be isotropic Gaussians with vari- ance in a prescribed interval or even Gaus- sians parameterized by multiple continuous parameters. Our work builds upon a formula- tion involving a minimax optimization prob- lem and a recently proposed greedy algorithm for learning the kernel. Although this opti- mization problem is not convex, it belongs to the larger class of DC (difference of convex functions) programs. Therefore, we apply re- cent results from DC optimization theory to create a new algorithm for learning the ker- nel. Our experimental results on benchmark data sets show that this algorithm outper- forms a previously proposed method. Appearing in Proceedings of the 23 rd International Con- ference on Machine Learning, Pittsburgh, PA, 2006. Copy- right 2006 by the author(s)/owner(s). 1. Introduction An essential ingredient in a wide variety of machine learning algorithms is the choice of the kernel. The performance of the algorithm is affected by which ker- nel is used. Commonly used kernels are Gaussian or polynomial ones. However, there are additional possi- ble classes of kernels one can use. Recent interest has focused on the question of learning the kernel from a prescribed family of available kernels, K, which is of- ten required to be convex. Generally, the method is to specify an objective function of the kernel and to optimize it over K. This task has been pursued from different perspectives, see (Argyriou et al., 2005; Bach et al., 2004; Lanckriet et al., 2004; Lin & Zhang, 2003; Micchelli & Pontil, 2005; Ong et al., 2003; Sonnenburg et al., 2006) and references therein. An essential aspect of our perspective is that we consider the convex hull of a continuous parameterized family. For example, the family of Gaussians whose covariance is an arbitrary positive multiple of the identity matrix, or the family of polynomial kernels of arbitrary degree. This point of view avoids deciding in advance which finite set of variances must be chosen to specify a finite set of ker- nels whose convex hull is then considered, see (Bach et al., 2004; Lanckriet et al., 2004; Lin & Zhang, 2003). Almost exclusively, Gaussians with isotropic covari- ance have been considered up to now, that is, the covariance is a multiple of the identity matrix. An important departure from previous work that we take in this paper is to consider the possibility that the co- variance is a full matrix, although perhaps constrained appropriately. This leads us to a challenging optimiza- tion problem for choosing the covariance matrix as a