1162 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 59, NO. 4, APRIL 2012
A Weighted Power Framework for Integrating
Multisource Information: Gene Function
Prediction in Yeast
Shubhra Sankar Ray
∗
, Sanghamitra Bandyopadhyay, Senior Member, IEEE, and Sankar K. Pal, Fellow, IEEE
Abstract—Predicting the functions of unannotated genes is one
of the major challenges of biological investigation. In this study,
we propose a weighted power scoring framework, called weighted
power biological score (WPBS), for combining different biological
data sources and predicting the function of some of the unclas-
sified yeast Saccharomyces cerevisiae genes. The relative power
and weight coefficients of different data sources, in the proposed
score, are estimated systematically by utilizing functional anno-
tations [yeast Gene Ontology (GO)-Slim: Process] of classified
genes, available from Saccharomyces Genome Database. Genes
are then clustered by applying k-medoids algorithm on WPBS, and
functional categories of 334 unclassified genes are predicted us-
ing a P-value cutoff 1 × 10
−5
. The WPBS is available online at
http://www.isical.ac.in/∼shubhra/WPBS/WPBS.html, where one
can download WPBS, related files, and a MATLAB code to predict
functions of unclassified genes.
Index Terms—Combinatorial optimization, gene expression,
phenotypic profile, protein sequence, transitive homology.
I. BACKGROUND
T
HE availability of high-throughput biological data, such
as, phenotypic profiles [1], gene expression microarrays
[2], protein sequences [3], Kyoto Encyclopedia of Genes and
Genomes (KEGG) pathway [4], and protein–protein interaction
data [5] has opened a new direction in genomic analysis and
function prediction of unclassified genes by combining multi-
source as well as multiscale information from these biological
data-sources. A single data source often lacks the degree of ac-
curacy needed for accurate gene function prediction, and this
can be improved by integrating different data sources in an ef-
ficient manner. Predicting functions of unclassified yeast genes
is an important task in biological research as it is considered
as a model eukaryotic organism. According to Saccharomyces
Genome Database (SGD) [6] and Munich Information for Pro-
tein Sequences (MIPS) [7], there are 6069 and 6130 genes for
yeast Saccharomyces cerevisiae, of which 4387 and 4737 genes,
Manuscript received July 22, 2011; revised December 27, 2011; accepted
January 18, 2012. Date of publication February 3, 2012; date of current version
March 21, 2012. Asterisk indicates corresponding author.
∗
S. S. Ray is with the Center for Soft Computing Research: A National
Facility, Indian Statistical Institute, Kolkata 700108, India and also with the
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India
(e-mail: shubhra@isical.ac.in).
S. Bandyopadhyay is with the Machine Intelligence Unit, Indian Statistical
Institute, Kolkata 700108, India (e-mail: sanghami@isical.ac.in).
S. K. Pal is with the Center for Soft Computing Research: A National Facility,
Indian Statistical Institute, Kolkata 700108, India (e-mail: sankar@isical.ac.in).
Digital Object Identifier 10.1109/TBME.2012.2186689
respectively, are classified into some biological process and the
remaining genes are unclassified. Out of 1682 and 1393 unclas-
sified genes in SGD and MIPS, 802 and 240 genes, respectively,
are either pseudogenes or dubious open reading frames (ORFs).
Hence, the number of unclassified genes, without pseudogenes
or dubious open reading frames, is 880 and 1153 in SGD and
MIPS, respectively. Functional prediction of these genes may
also help in classifying human genes with unknown functions.
Mering et al. [8] first developed quantitative methods to mea-
sure and predict functional relationship among genes by first
benchmarking, and then integrating information from differ-
ent data sources. In [9], proteins are grouped by correlated
evolution [10], correlated gene expression [2], and patterns
of domain fusion [11] to determine functional relationships
among the 6217 proteins of the yeast Saccharomyces cerevisiae.
Troyanskaya et al. [12] integrated data sources in the Bayesian
network approach and predicted functional modules by using a
clustering algorithm based on the principle of K-nearest neigh-
bor (KNN) algorithm. Interacting networks are predicted in [13]
which, not only identifies highly interacting and functionally
connected genes, but also those which are sparsely connected
with others. Lee et al. [14] derived log likelihood scores from
the various datasets, weighted them with a rank-order dependent
weighting scheme and added them to find a combined similar-
ity using the Bayesian Score. Our previous work in Ray et al.,
2009, [15] focuses on integrating multiscale information from
data sources in a linear combination style through multiple free
parameters. Functional categories of 12 unclassified yeast genes
are also predicted in this study.
However, the performance of our previous integration method
[15] can be improved by incorporating additional free parame-
ters to get estimate of relative powers of individual information,
obtained from different datasources. In this regard, we present a
new weighted power scoring framework, called weighted power
biological score (WPBS ), where besides the existing linear
weights, we incorporate new different free parameters involving
power estimates of positive predictive values (PPV ), obtained
from different data sources, namely, phenotypic profiles, cDNA
microarray expression, KEGG pathway information, protein
similarity through transitive homologues, and protein–protein
interaction information.
II. PROPOSED APPROACH FOR MULTISCALE
DATA INTEGRATION
The main steps of our methodology involves: 1) extrac-
tion of pairwise similarity of yeast Saccharomyces cerevisiae
0018-9294/$31.00 © 2012 IEEE