Learning general Gaussian kernel hyperparameters of SVMs using optimization on symmetric positive-definite matrices manifold Hicham Laanaya a,b, , Fahed Abdallah a , Hichem Snoussi c , Cédric Richard c a Centre de Recherche de Royallieu, Lab.Heudiasyc, UMR CNRS 6599, BP 20529,60205 Compiègne, France b Faculté des Sciences Rabat, Université Mohammed V-Agdal, 4 Avenue Ibn Battouta, B.P.1014 RP,Rabat,Morocco c Institut Charles Delaunay (FRE CNRS 2848), Université de Technologie de Troyes, 10010 Troyes, France a r t i c l e i n f o Article history: Received 21 July 2010 Available online 24 May 2011 Communicated by Y.Ma Keywords: Kernel optimization Support vector machines General Gaussian kernel Symmetric positive-definite matrices manifold a b s t r a c t We propose a new method for general Gaussian kernel hyperparameter optimization for support vector machines classification. The hyperparameters are constrained to lie on a differentiable manifold. The pro posed optimization technique is based on a gradient-like descent algorithm adapted to the geometrical structure of the manifold of symmetric positive-definite matrices. We compare the performance of our approach with the classical support vector machine for classification and with other methods of the state of the art on toy data and on real world data sets. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Support Vector Machine (SVM) is a promising pattern classifica- tion technique proposed by Vapnik (1995). Unlike traditional methods which minimize the empirical training error, SVM aims at minimizing an upper bound of the generalization error through maximizing the margin between the separating hyperplane and the data.This can be regarded as an approximate implementation of the Structure Risk Minimization principle.What makes SVM attractive is the property of condensing information in the training data and providing a sparse representation by using a very small number of data points called support vectors (SVs) (Girosi, 1998). The key features of SVMs are the use of kernels, the absence of local minima, the sparseness of the solution and the capacity con- trol obtained by optimizing the margin (Cristianiniand Shawe- Taylor, 2000). Nevertheless an SVM based method is unable to give accurate results in high dimensional spaces when more than one dimension are noisy (Grandvalet and Canu, 2002; Weston et al., 2000).Another limitation of the support vector approach lies in the choice of the kernel and its eventual hyperparameter. Hyperpa- rameter selection is in fact crucial to enhance the performance of an SVM classifier.Different works were introduced to dealwith this problem for different aims; Gold and Sollich (2003), Grandva- let and Canu (2002), Lanckriet et al. (2004) and Weston et al. (2000) introduced methods for feature selection problem using a Gaussian kerneland Chen and Ye (2008),Lanckriet et al.(2004) and Luss and d’Aspremont (2008) learn directly the optimal kernel matrix, also called Gram matrix, from the training data using semi- definite programming or using an initial guess (similarity matrix) of the kernel. These methods use similar optimization problem and give the solution based on gradient descent approaches. Note that authors in (Lanckriet et al., 2004; Luss and d’Aspremont, 2008) estimate simultaneously the kernel matrix for training and test examples and the kernelfunction expression is not determined. Well, learning directly the kernel matrix is technically consuming as we have to learn and store n (n + 1)/2 parameters, where n is the number of examples in the database. Furthermore,estimat- ing the kernel matrix on the given data set will not be directly usable to classify unseen examples. In a different manner, and for the same classification problem, methods proposed for feature selection learn the Gaussian kernel hyperparameter as a diagonal matrix Q of dimension d d where d is the number of features, and do not take into account the even- tual relationship between features (as in feature extraction problems). We propose here a new method for hyperparameters learning for general Gaussian kernel of the form: k Q ðx; yÞ ¼ exp 1 2 ðx yÞ T Qðx yÞ ; ð1Þ where x; y 2 R d , and Q is a d d symmetric positive-definite matrix to be adjusted in order to answer adequately a specified criterion, 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.05.009 Corresponding author at: Faculté des Sciences Rabat, Université Mohammed V- Agdal, 4 Avenue Ibn Battouta, B.P. 1014 RP, Rabat, Morocco. Tel.: +212 6 64 73 18 00. E-mail addresses: Hicham.Laanaya@hds.utc.fr, hicham.laanaya@gmail.com (H. Laanaya), Fahed.Abdallah@hds.utc.fr (F. Abdallah), Hichem.Snoussi@utt.fr(H. Snoussi), cedric.richard@unice.fr (C. Richard). Pattern Recognition Letters 32 (2011) 1511–1515 Contents lists available at ScienceDirect Pattern Recognition Letters j o u r n a l h omepage: w w w . e l s e v i e r . c o m / l o c a t e / p a t r e c