Assigning New GO Annotations to Protein Data Bank Sequences by Combining Structure and Sequence Homology Julia V. Ponomarenko, 1,2 * Philip E. Bourne, 1,3 and Ilya N. Shindyalov 1 1 San Diego Supercomputer Center, University of California, San Diego, La Jolla, California 2 Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia 3 Department of Pharmacology, University of California, San Diego, La Jolla, California ABSTRACT Accompanying the discovery of an increasing number of proteins, there is the need to provide functional annotation that is both highly accurate and consistent. The Gene Ontology™ (GO) provides consistent annotation in a computer read- able and usable form; hence, GO annotation (GOA) has been assigned to a large number of protein sequences based on direct experimental evidence and through inference determined by sequence ho- mology. Here we show that this annotation can be extended and corrected for cases where protein structures are available. Specifically, using the Com- binatorial Extension (CE) algorithm for structure comparison, we extend the protein annotation cur- rently provided by GOA at the European Bioinfor- matics Institute (EBI) to further describe the con- tents of the Protein Data Bank (PDB). Specific cases of biologically interesting annotations derived by this method are given. Given that the relationship between sequence, structure, and function is compli- cated, we explore the impact of this relationship on assigning GOA. The effect of superfolds (folds with many functions) is considered and, by comparison to the Structural Classification of Proteins (SCOP), the individual effects of family, superfamily, and fold. Proteins 2005;58:855– 865. © 2005 Wiley-Liss, Inc. Key words: Gene Ontology annotation; Protein Data Bank; three-dimensional protein struc- ture; structure homology; sequence ho- mology; structure comparison; protein annotation INTRODUCTION The ongoing process of describing the functional proper- ties and biological roles of all proteins represents a major task of modern molecular biology. The evolving Gene Ontology™ (GO), 1,2 which standardizes this description, is vital to this process. GO provides the vocabularies (GO terms) and relationships in the form of a directed acyclic graph (DAG) for describing molecular function, biological process, and cellular localization of gene products from multiple organisms (16,687 terms as of December 8, 2003; http://www.geneontology.org/). Importantly, GO can be easy interpreted and used by computers. GO is widely used to annotate proteins using data derived from experiments (microarrays, 2-hybrid screens, etc.), from data already present in biological databases (usually by means of literature curation) and from data derived by theoretical approaches (e.g., Jensen et al., 3 Lagreid et al., 4 and Letovsky and Kasif 5 ). Currently, GO annotation is provided by a number of single-species oriented databases such as Saccharomyces Genome Data- base (SGD), 6 The Arabidopsis Information Resource (TAIR), 7 and Mouse Genome Database (MGD), 8 as well as multi-species databases such as The Institute for Genomic Research (TIGR; http://www.tigr.org), Sanger GeneDB (http://www.genedb.org), and Gene Ontology Annotation (GOA) at the European Bioinformatics Institute (EBI; http://www.ebi.ac.uk/GOA/). 9 As of December 13, 2003, TIGR provides GO annotation for 126,556 proteins and GOA EBI for 797,117 proteins; both can be freely down- loaded. Currently, the best annotation of proteins using GO is performed by highly trained biologists who read the litera- ture and select the appropriate GO terms to be applied. Since this manual process is time-consuming and expen- sive, the accurate assignment of GO terms to proteins through automated extension of manual annotation is of significance. The commonly used automated approach is to infer functional similarity by establishing the presence of sequence homology to existing functionally annotated protein(s). A number of GO tools have been created that exploit this approach; see, for example, Goblet 10 and OntoBlast. 11 Compugen, Inc. has extended this basic scheme further by developing the GO Engine, which uses sequence homology, a protein clustering procedure, and text information. 12 Here we extend the sequence relation- ship by adding the relationship between protein struc- tures. The relationship between sequence, structure, and func- tion is complicated, yet defines whether structure can be Grant sponsor: National Institutes of Health; Grant number: GM063208. *Correspondence to: Julia V. Ponomarenko, San Diego Supercom- puter Center, University of California, San Diego, UCSD MC 0537, 9500 Gilman Drive, La Jolla, CA 92093-0537. E-mail: jpon@sdsc.edu Received 17 February 2004; Accepted 16 September 2004 Published online 11 January 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20355 PROTEINS: Structure, Function, and Bioinformatics 58:855– 865 (2005) © 2005 WILEY-LISS, INC.