PROGRAM DESCRIPTION
Strategies for Biological
Annotation of Mammalian
Systems: Implementing Gene
Ontologies in Mouse Genome
Informatics
David P. Hill,
1
Allan P. Davis,
Joel E. Richardson, John P. Corradi,
Martin Ringwald, Janan T. Eppig,
and Judith A. Blake
The Jackson Laboratory, 600 Main Street,
Bar Harbor, Maine 04609
Introduction
The nature of biology is informational. Instructions con-
tained in the genetic code are replicated, transcribed, and
translated into molecules that carry out the biological func-
tions of the gene products. Throughout time, this information
is subjected to evolutionary pressures that lead to diversity
and speciation, but also to a conservation of functions, pro-
cesses, and structures that are biologically useful.
Bioinformatics curates and integrates biological informa-
tion for computational purposes. One requirement for this is
the development of dictionaries of standardized terms con-
taining logical relationships that are consistently used be-
tween databases.
The Gene Ontology (GO) project was initiated when three
model-organism databases, the Drosophila genome database
(FlyBase Consortium, 1999), the Saccharomyces genome da-
tabase (SGD; Ball et al., 2000), and the mouse genome infor-
matics databases (MGD and GXD, hereafter referred to
jointly as MGI; Blake et al., 2000; Ringwald et al., 2000),
realized the need for a common annotation language to de-
scribe conserved biological principals. This common need re-
sulted in the development of a shared, defined set of terms
that allows querying by the term and recovers the genes or
gene products associated with that term within multiple
(model organism) species. Curators from each of the model-
organism databases contribute to the structure and content
of the ontologies, allowing for the cross-species coverage of
the resource.
The collection of annotations that are contributed by each
of the model-organism databases is compiled into a common
data resource that can be used to explore precise terms across
multiple organisms (Ashburner et al., 2000). In addition,
each model-organism database has its own unique represen-
tation of the GO. Here we detail the strategies used at MGI
to curate and use the GO for the annotation of mouse biology.
The Gene Ontologies
The GO is composed of structured vocabularies of terms.
Each term is precisely defined so the meaning of each term is
understood by curators. Precisely defined terms are essential
to ensure consistent annotation when multiple curators are
annotating genes.
The GO is divided into three categories that reflect con-
served aspects of biology: molecular function, biological pro-
cess, and cellular component. The vocabularies contain a
structured hierarchy of terms designed to describe what the
gene product does and where the gene product is located in
the cell. Gene products may be annotated to many terms from
each ontology, since a gene may have more than one function,
be involved in a variety of processes, or carry out its function
in a variety of subcellular locations.
The Molecular Function Ontology describes the fundamen-
tal biochemical function of a gene product without regard to
its biological context (Fig. 1a). Examples of molecular func-
tion include general terms such as “enzyme” and more spe-
cific terms such as “mRNA cap binding.” In some cases, such
as enzyme, the term is meant to describe the action of the
term. Enzyme describes the function of catalysis. Enzyme is
chosen for the vocabulary, because biologists are likely to use
the word enzyme to describe the function of the molecule. In
some instances, it is simply more convenient to use the noun
that carries out the function rather than a complex term to
describe the function itself, for example, RNA guanyltrans-
ferase. In any case, annotation to these terms is assigned by
the actual function that the term describes.
The Biological Process Ontology describes a gene product’s
global biological objectives (Fig. 1b). Processes may or may
not be well defined biochemically, but they have a biological
context. Process terms are not meant to describe a pathway.
Process terms, like other vocabulary terms, vary in their
level of specificity. Some terms such as “signal transduction”
describe general processes and others such as “pyrimidine
metabolism” describe more specific processes. The structured
nature of the vocabulary allows for relationships between the
different levels of description of a process (see below). For
example, in Fig. 1b, the process of splicing is a part of the
more general process of transcription.
The Cellular Component Ontology describes the subcellu-
lar component where a gene product is found (Fig. 1c). The
vocabulary contains not only classical cell biology terms such
as “plasma membrane,” but also terms that describe macro-
molecular structures such as “transcription factor complex.”
1
To whom correspondence should be addressed. Telephone: (207)
288-6430. E-mail: dph@informatics.jax.org.
All articles available online at http://www.idealibrary.com on
SPECIAL FEATURE
121 0888-7543/01 $35.00 Genomics 74, 121–128 (2001)
Copyright © 2001 by Academic Press
doi:10.1006/geno.2001.6513
All rights of reproduction in any form reserved.