Generating GO Slim Using Relational Database Management Systems to
Support Proteomics Analysis
Getiria Onsongo
1
, Hongwei Xie
2
, Timothy J. Griffin
2
, John Carlis
1
[1] Dept of Computer Science and Engineering,
University of Minnesota, 200 Union St SE, Minneapolis, MN 55455.
[2] Dept of Biochemistry, Molecular Biology and Biophysics,
University of Minnesota, 321 Church St. Minneapolis, MN 55455
Abstract
The Gene Ontology Consortium built the Gene Ontology
database (GO) to address the need for a common standard
in naming genes and gene products. Using different names
for the same concepts and different concepts with the same
name makes it effectively impossible for humans and com-
puters alike to analyze biological processes across different
organisms. The consortium addresses this need by defining
terms for categorizing genes and gene products. A conven-
tion in GO is that each gene or gene product is annotated to
the most specific GO term in the GO database. It is, how-
ever, also useful for researchers to be able to group genes
or gene products into broad biological categories that give
a higher-level view of their function when analyzing results
of an experiment. A GO Slim is a subset of the GO ontology
that provides such a higher-level view of functions. Existing
GO Slim generation tools have two important limitations:
programming language dependence, and an inability to dy-
namically generate a GO Slim while analyzing. We have ex-
tended the relational database engine to dynamically gen-
erate a GO Slim overcoming this limitations. Using this
extention, we have developed a tool (DynamicGOSlim) that
dynamically generates a GO Slim and uses the generated
GO Slim to categorize genes or gene products. This tool is
being used in an ongoing proteomics project aimed at iden-
tifying possible oral cancer biomarkers in saliva.
1. Introduction
The discovery nature of biological science naturally
leads to scientists naming what they find. However, giving
different names to what turns out to be the same concept and
giving different concepts the same name impedes science,
making it effectively impossible for humans and computers
alike to analyze biological concepts within and especially
across different organisms. The Gene Ontology Consor-
tium was formed to help reduce this babel, specifically to
“ produce a dynamic, controlled vocabulary that can be ap-
plied to all eukaryotes even as knowledge of gene and pro-
tein roles in cells is accumulating and changing ” [1]. The
consortium has successfully encouraged the disciplined use
of a common language by establishing, by consensus, a re-
stricted vocabulary, making it publicly available in the Gene
Ontology database, GO, and providing mechanisms for its
periodic update. Now it is common for GO terms to be used
in the research literature and public databases [2], [5].
GO consists of genes and gene products plus certain con-
cepts, called terms, associated with them, and, in addition,
other data that is not relevant here. GO organizes terms
and parent-child relationships between terms into three sep-
arate ontologies for biological processes, molecular func-
tions and cellular components. Each ontology forms a di-
rected acyclic graph, DAG, with each node being a term and
each parent-child relationship being a directed arc between
distinct nodes. In GO each child term is a more specific pro-
cess, function or component than each of its parent terms.
An association connects a gene or gene product with the
most specific possible term, and implicitly applies to the
term´ s ancestors. Collectively, the genes and gene products
associated with a term are called its annotation. Figure 1
shows a small portion of GO, with terms appearing inside
rectangles, genes or gene product associated with a term ap-
pearing inside ellipses attached to its rectangle, and parent
term - child term relationships appearing as arrows.
1.1. GO Slim
Tools to produce variants of GO called GO Slims were
developed because, for some tasks, such as analyzing the
results of an experiment, two characteristics of GO make
it less than ideal. First, users may be interested in only a
small portion of the entire database and masses of irrelevant
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.77
215
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.77
215
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.77
215
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.77
215