1162 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 59, NO. 4, APRIL 2012 A Weighted Power Framework for Integrating Multisource Information: Gene Function Prediction in Yeast Shubhra Sankar Ray , Sanghamitra Bandyopadhyay, Senior Member, IEEE, and Sankar K. Pal, Fellow, IEEE Abstract—Predicting the functions of unannotated genes is one of the major challenges of biological investigation. In this study, we propose a weighted power scoring framework, called weighted power biological score (WPBS), for combining different biological data sources and predicting the function of some of the unclas- sified yeast Saccharomyces cerevisiae genes. The relative power and weight coefficients of different data sources, in the proposed score, are estimated systematically by utilizing functional anno- tations [yeast Gene Ontology (GO)-Slim: Process] of classified genes, available from Saccharomyces Genome Database. Genes are then clustered by applying k-medoids algorithm on WPBS, and functional categories of 334 unclassified genes are predicted us- ing a P-value cutoff 1 × 10 5 . The WPBS is available online at http://www.isical.ac.in/shubhra/WPBS/WPBS.html, where one can download WPBS, related files, and a MATLAB code to predict functions of unclassified genes. Index Terms—Combinatorial optimization, gene expression, phenotypic profile, protein sequence, transitive homology. I. BACKGROUND T HE availability of high-throughput biological data, such as, phenotypic profiles [1], gene expression microarrays [2], protein sequences [3], Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway [4], and protein–protein interaction data [5] has opened a new direction in genomic analysis and function prediction of unclassified genes by combining multi- source as well as multiscale information from these biological data-sources. A single data source often lacks the degree of ac- curacy needed for accurate gene function prediction, and this can be improved by integrating different data sources in an ef- ficient manner. Predicting functions of unclassified yeast genes is an important task in biological research as it is considered as a model eukaryotic organism. According to Saccharomyces Genome Database (SGD) [6] and Munich Information for Pro- tein Sequences (MIPS) [7], there are 6069 and 6130 genes for yeast Saccharomyces cerevisiae, of which 4387 and 4737 genes, Manuscript received July 22, 2011; revised December 27, 2011; accepted January 18, 2012. Date of publication February 3, 2012; date of current version March 21, 2012. Asterisk indicates corresponding author. S. S. Ray is with the Center for Soft Computing Research: A National Facility, Indian Statistical Institute, Kolkata 700108, India and also with the Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India (e-mail: shubhra@isical.ac.in). S. Bandyopadhyay is with the Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India (e-mail: sanghami@isical.ac.in). S. K. Pal is with the Center for Soft Computing Research: A National Facility, Indian Statistical Institute, Kolkata 700108, India (e-mail: sankar@isical.ac.in). Digital Object Identifier 10.1109/TBME.2012.2186689 respectively, are classified into some biological process and the remaining genes are unclassified. Out of 1682 and 1393 unclas- sified genes in SGD and MIPS, 802 and 240 genes, respectively, are either pseudogenes or dubious open reading frames (ORFs). Hence, the number of unclassified genes, without pseudogenes or dubious open reading frames, is 880 and 1153 in SGD and MIPS, respectively. Functional prediction of these genes may also help in classifying human genes with unknown functions. Mering et al. [8] first developed quantitative methods to mea- sure and predict functional relationship among genes by first benchmarking, and then integrating information from differ- ent data sources. In [9], proteins are grouped by correlated evolution [10], correlated gene expression [2], and patterns of domain fusion [11] to determine functional relationships among the 6217 proteins of the yeast Saccharomyces cerevisiae. Troyanskaya et al. [12] integrated data sources in the Bayesian network approach and predicted functional modules by using a clustering algorithm based on the principle of K-nearest neigh- bor (KNN) algorithm. Interacting networks are predicted in [13] which, not only identifies highly interacting and functionally connected genes, but also those which are sparsely connected with others. Lee et al. [14] derived log likelihood scores from the various datasets, weighted them with a rank-order dependent weighting scheme and added them to find a combined similar- ity using the Bayesian Score. Our previous work in Ray et al., 2009, [15] focuses on integrating multiscale information from data sources in a linear combination style through multiple free parameters. Functional categories of 12 unclassified yeast genes are also predicted in this study. However, the performance of our previous integration method [15] can be improved by incorporating additional free parame- ters to get estimate of relative powers of individual information, obtained from different datasources. In this regard, we present a new weighted power scoring framework, called weighted power biological score (WPBS ), where besides the existing linear weights, we incorporate new different free parameters involving power estimates of positive predictive values (PPV ), obtained from different data sources, namely, phenotypic profiles, cDNA microarray expression, KEGG pathway information, protein similarity through transitive homologues, and protein–protein interaction information. II. PROPOSED APPROACH FOR MULTISCALE DATA INTEGRATION The main steps of our methodology involves: 1) extrac- tion of pairwise similarity of yeast Saccharomyces cerevisiae 0018-9294/$31.00 © 2012 IEEE