Protein Function Prediction by Integrating Multiple Kernels ∗ Guoxian Yu 1 , Huzefa Rangwala 2 , Carlotta Domeniconi 2 , Guoji Zhang 1 , Zili Zhang 3 1 School of Computer Sci. and Eng., South China University of Technology, Guangzhou, China 2 Department of Computer Science, George Mason University, VA, USA 3 School of Computer and Information Science, Southwest University, Chongqing, China 1 guoxian85@gmail.com, 2 {rangwala, carlotta}@cs.gmu.edu Abstract Determining protein function constitutes an exercise in integrating information derived from several het- erogeneous high-throughput experiments. To utilize the information spread across multiple sources in a combined fashion, these data sources are trans- formed into kernels. Several protein function predic- tion methods follow a two-phased approach: they first optimize the weights on individual kernels to produce a composite kernel, and then train a classi- fier on the composite kernel. As such, these methods result in an optimal composite kernel, but not nec- essarily in an optimal classifier. On the other hand, some methods optimize the loss of binary classifiers, and learn weights for the different kernels iteratively. A protein has multiple functions, and each function can be viewed as a label. These methods solve the problem of optimizing weights on the input kernels for each of the labels. This is computationally ex- pensive and ignores inter-label correlations. In this paper, we propose a method called Protein Function Prediction by Integrating Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reducing the empirical loss of a multi-label classifier for each of the labels simultaneously, using a combined objective function. ProMK can assign larger weights to smooth kernels and downgrade the weights on noisy kernels. We evaluate the ability of ProMK to predict the function of proteins using several standard benchmarks. We show that our approach performs better than previously proposed protein function prediction approaches that integrate data from multiple networks, and multi-label multiple kernel learning methods. 1 Introduction Understanding biological mechanisms constitutes an exer- cise in integrating information derived from several hetero- ∗ This work is partially supported by grants from NSF IIS (0905117, 1252318), NSFC (61070090, 61003174) and China Schol- arship Council. geneous high-throughput experiments. In modern day bi- ology, for a given protein, the different forms of collected data can be its sequence (linear chain of amino acids), its three-dimensional structure, various interactions (e.g., protein- protein interactions) and gene co-expression. Determining the function of a protein using experimental approaches is time consuming and expensive. As such, several computa- tional approaches have been proposed to predict the func- tion of a protein by integrating different sources of avail- able data and have shown superior empirical performance in comparison to training a protein function prediction model only on one of the data sources [Lanckriet et al., 2004; Mostafavi and Morris, 2010]. Several protein function prediction approaches involve representing different data sources as individual kernels (or graphs) and integrating the different kernels within a multiple kernel learning framework [Lanckriet et al., 2004]. Each data source is represented by a kernel function K that measures the pairwise similarities between proteins. K also captures the un- derlying biological complexity associated with the data. Multi- ple kernels are integrated by finding optimal weights within a semi-definite programming framework [Lanckriet et al., 2004]. Tsuda et al. [Tsuda et al., 2005] determine the optimal com- bination of networks and predictions by taking advantage of the dual problem. Mostafavi et al. [Mostafavi et al., 2008] construct the optimal composite graph by solving a linear re- gression problem. Alternatively, another set of approaches use classifier ensembles to integrate the predictions from mod- els trained on individual sources [Cesa-Bianchi et al., 2012; Yu et al., 2012]. In this paper, we focus on protein function prediction by integrating multiple kernels. Proteins are multi-functional and each function can be viewed as a label. Therefore, the protein function predic- tion problem can be viewed as a multi-label learning prob- lem [Tsoumakas et al., 2010]. The above described meth- ods [Tsuda et al., 2005; Mostafavi et al., 2008] divide the multi-label learning problem into multiple binary classifica- tion problems and ignore label correlations, which are known to be beneficial for multi-label classification [Tsoumakas et al., 2010]. To make use of inter-label dependencies, Mostafavi et al. [Mostafavi and Morris, 2010] proposed an approached called Simultaneous Weights (SW). SW first determines the optimal combination of weights by considering a group of functions instead of a single one, and then trains multiple