Assigning New GO Annotations to Protein Data Bank
Sequences by Combining Structure and Sequence
Homology
Julia V. Ponomarenko,
1,2
*
Philip E. Bourne,
1,3
and Ilya N. Shindyalov
1
1
San Diego Supercomputer Center, University of California, San Diego, La Jolla, California
2
Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
3
Department of Pharmacology, University of California, San Diego, La Jolla, California
ABSTRACT Accompanying the discovery of an
increasing number of proteins, there is the need to
provide functional annotation that is both highly
accurate and consistent. The Gene Ontology™ (GO)
provides consistent annotation in a computer read-
able and usable form; hence, GO annotation (GOA)
has been assigned to a large number of protein
sequences based on direct experimental evidence
and through inference determined by sequence ho-
mology. Here we show that this annotation can be
extended and corrected for cases where protein
structures are available. Specifically, using the Com-
binatorial Extension (CE) algorithm for structure
comparison, we extend the protein annotation cur-
rently provided by GOA at the European Bioinfor-
matics Institute (EBI) to further describe the con-
tents of the Protein Data Bank (PDB). Specific cases
of biologically interesting annotations derived by
this method are given. Given that the relationship
between sequence, structure, and function is compli-
cated, we explore the impact of this relationship on
assigning GOA. The effect of superfolds (folds with
many functions) is considered and, by comparison
to the Structural Classification of Proteins (SCOP),
the individual effects of family, superfamily, and
fold. Proteins 2005;58:855– 865. © 2005 Wiley-Liss, Inc.
Key words: Gene Ontology annotation; Protein Data
Bank; three-dimensional protein struc-
ture; structure homology; sequence ho-
mology; structure comparison; protein
annotation
INTRODUCTION
The ongoing process of describing the functional proper-
ties and biological roles of all proteins represents a major
task of modern molecular biology. The evolving Gene
Ontology™ (GO),
1,2
which standardizes this description, is
vital to this process. GO provides the vocabularies (GO
terms) and relationships in the form of a directed acyclic
graph (DAG) for describing molecular function, biological
process, and cellular localization of gene products from
multiple organisms (16,687 terms as of December 8, 2003;
http://www.geneontology.org/). Importantly, GO can be
easy interpreted and used by computers.
GO is widely used to annotate proteins using data
derived from experiments (microarrays, 2-hybrid screens,
etc.), from data already present in biological databases
(usually by means of literature curation) and from data
derived by theoretical approaches (e.g., Jensen et al.,
3
Lagreid et al.,
4
and Letovsky and Kasif
5
). Currently, GO
annotation is provided by a number of single-species
oriented databases such as Saccharomyces Genome Data-
base (SGD),
6
The Arabidopsis Information Resource
(TAIR),
7
and Mouse Genome Database (MGD),
8
as well as
multi-species databases such as The Institute for Genomic
Research (TIGR; http://www.tigr.org), Sanger GeneDB
(http://www.genedb.org), and Gene Ontology Annotation
(GOA) at the European Bioinformatics Institute (EBI;
http://www.ebi.ac.uk/GOA/).
9
As of December 13, 2003,
TIGR provides GO annotation for 126,556 proteins and
GOA EBI for 797,117 proteins; both can be freely down-
loaded.
Currently, the best annotation of proteins using GO is
performed by highly trained biologists who read the litera-
ture and select the appropriate GO terms to be applied.
Since this manual process is time-consuming and expen-
sive, the accurate assignment of GO terms to proteins
through automated extension of manual annotation is of
significance. The commonly used automated approach is to
infer functional similarity by establishing the presence of
sequence homology to existing functionally annotated
protein(s). A number of GO tools have been created that
exploit this approach; see, for example, Goblet
10
and
OntoBlast.
11
Compugen, Inc. has extended this basic
scheme further by developing the GO Engine, which uses
sequence homology, a protein clustering procedure, and
text information.
12
Here we extend the sequence relation-
ship by adding the relationship between protein struc-
tures.
The relationship between sequence, structure, and func-
tion is complicated, yet defines whether structure can be
Grant sponsor: National Institutes of Health; Grant number:
GM063208.
*Correspondence to: Julia V. Ponomarenko, San Diego Supercom-
puter Center, University of California, San Diego, UCSD MC 0537,
9500 Gilman Drive, La Jolla, CA 92093-0537. E-mail: jpon@sdsc.edu
Received 17 February 2004; Accepted 16 September 2004
Published online 11 January 2005 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/prot.20355
PROTEINS: Structure, Function, and Bioinformatics 58:855– 865 (2005)
© 2005 WILEY-LISS, INC.