Inferring Sparse Kernel Combinations and Relevance Vectors: An application to subcellular localization of proteins. Theodoros Damoulas † , Yiming Ying * , Mark A. Girolami † and Colin Campbell * † Department of Computing Science * Department of Engineering Mathematics University of Glasgow University of Bristol Sir Alwyn Williams Buidling Queen’s Building Lilybank Gardens University Walk Glasgow G12 8QQ, Scotland, UK Bristol, BS8 1TR, England, UK {theo, girolami}@dcs.gla.ac.uk {enxyy, C.Campbell}@bris.ac.uk Abstract In this paper, we introduce two new formulations for multi-class multi-kernel relevance vector machines (m- RVMs) that explicitly lead to sparse solutions, both in sam- ples and in number of kernels. This enables their appli- cation to large-scale multi-feature multinomial classiﬁca- tion problems where there is an abundance of training sam- ples, classes and feature spaces. The proposed methods are based on an expectation-maximization (EM) framework em- ploying a multinomial probit likelihood and explicit prun- ing of non-relevant training samples. We demonstrate the methods on a low-dimensional artiﬁcial dataset. We then demonstrate the accuracy and sparsity of the method when applied to the challenging bioinformatics task of predicting protein subcellular localization. 1. Introduction Recently multi-kernel learning methods (MKL methods) have attracted great interest in the machine learning com- munity [10, 6, 13, 14, 11]. Since many supervised learn- ing tasks in biology involve heterogeneous data they have been successfully applied to many important bioinformat- ics problems [9, 12, 2], often providing state-of-the-art per- formance. The intuition behind these multi-kernel methods is to represent a set of heterogeneous features via differ- ent types of kernels and to combine the resulting kernels in a convex combination: this is illustrated in Figure 1. In other words, kernel functions k, with corresponding kernel parameters θ, represent the similarities between objects x n based on their feature vectors k(x i , x j )= 〈Φ(x i ), Φ(x j )〉 Learning the kernel combination parameters β is there- fore an important component of the learning problem. Most MKL research has been done within the popular framework of support vector machines (SVMs) with progress concen- trated on ﬁnding computationally efﬁcient algorithms via improved optimization routines [15, 20]. Such methods provide sparse solutions in samples and kernels, due to the optimisation over hyperplane normal parameters w and ker- nel combination parameters β, but they inherit the draw- backs of the non-probabilistic and binary nature of SVMs. FS 1 FS 2 FS 3 FS 4 S 1 S 2 S 3 S 4 Comp CL Original Feature Spaces Similarity Spaces Composite Space Classiﬁer β 1 β 2 β 3 β 4 ϑ 1 ϑ 2 ϑ 3 ϑ 4 Figure 1. The intuition for MKL: From a heteroge- nous multitude of feature spaces, to a common metric and ﬁnally to a composite space. In the Bayesian paradigm, the functional form analogous to SVMs is the relevance vector machine (RVM) [18] which employs sparse Bayesian learning via an appropriate prior formulation. Maximization of the marginal likelihood, a type-II maximum likelihood (ML) expression, gives sparse solutions which utilize only a subset of the basis functions: the relevance vectors. Compared to an SVM, there are rel-