PROGRAM DESCRIPTION Strategies for Biological Annotation of Mammalian Systems: Implementing Gene Ontologies in Mouse Genome Informatics David P. Hill, 1 Allan P. Davis, Joel E. Richardson, John P. Corradi, Martin Ringwald, Janan T. Eppig, and Judith A. Blake The Jackson Laboratory, 600 Main Street, Bar Harbor, Maine 04609 Introduction The nature of biology is informational. Instructions con- tained in the genetic code are replicated, transcribed, and translated into molecules that carry out the biological func- tions of the gene products. Throughout time, this information is subjected to evolutionary pressures that lead to diversity and speciation, but also to a conservation of functions, pro- cesses, and structures that are biologically useful. Bioinformatics curates and integrates biological informa- tion for computational purposes. One requirement for this is the development of dictionaries of standardized terms con- taining logical relationships that are consistently used be- tween databases. The Gene Ontology (GO) project was initiated when three model-organism databases, the Drosophila genome database (FlyBase Consortium, 1999), the Saccharomyces genome da- tabase (SGD; Ball et al., 2000), and the mouse genome infor- matics databases (MGD and GXD, hereafter referred to jointly as MGI; Blake et al., 2000; Ringwald et al., 2000), realized the need for a common annotation language to de- scribe conserved biological principals. This common need re- sulted in the development of a shared, defined set of terms that allows querying by the term and recovers the genes or gene products associated with that term within multiple (model organism) species. Curators from each of the model- organism databases contribute to the structure and content of the ontologies, allowing for the cross-species coverage of the resource. The collection of annotations that are contributed by each of the model-organism databases is compiled into a common data resource that can be used to explore precise terms across multiple organisms (Ashburner et al., 2000). In addition, each model-organism database has its own unique represen- tation of the GO. Here we detail the strategies used at MGI to curate and use the GO for the annotation of mouse biology. The Gene Ontologies The GO is composed of structured vocabularies of terms. Each term is precisely defined so the meaning of each term is understood by curators. Precisely defined terms are essential to ensure consistent annotation when multiple curators are annotating genes. The GO is divided into three categories that reflect con- served aspects of biology: molecular function, biological pro- cess, and cellular component. The vocabularies contain a structured hierarchy of terms designed to describe what the gene product does and where the gene product is located in the cell. Gene products may be annotated to many terms from each ontology, since a gene may have more than one function, be involved in a variety of processes, or carry out its function in a variety of subcellular locations. The Molecular Function Ontology describes the fundamen- tal biochemical function of a gene product without regard to its biological context (Fig. 1a). Examples of molecular func- tion include general terms such as “enzyme” and more spe- cific terms such as “mRNA cap binding.” In some cases, such as enzyme, the term is meant to describe the action of the term. Enzyme describes the function of catalysis. Enzyme is chosen for the vocabulary, because biologists are likely to use the word enzyme to describe the function of the molecule. In some instances, it is simply more convenient to use the noun that carries out the function rather than a complex term to describe the function itself, for example, RNA guanyltrans- ferase. In any case, annotation to these terms is assigned by the actual function that the term describes. The Biological Process Ontology describes a gene product’s global biological objectives (Fig. 1b). Processes may or may not be well defined biochemically, but they have a biological context. Process terms are not meant to describe a pathway. Process terms, like other vocabulary terms, vary in their level of specificity. Some terms such as “signal transduction” describe general processes and others such as “pyrimidine metabolism” describe more specific processes. The structured nature of the vocabulary allows for relationships between the different levels of description of a process (see below). For example, in Fig. 1b, the process of splicing is a part of the more general process of transcription. The Cellular Component Ontology describes the subcellu- lar component where a gene product is found (Fig. 1c). The vocabulary contains not only classical cell biology terms such as “plasma membrane,” but also terms that describe macro- molecular structures such as “transcription factor complex.” 1 To whom correspondence should be addressed. Telephone: (207) 288-6430. E-mail: dph@informatics.jax.org. All articles available online at http://www.idealibrary.com on SPECIAL FEATURE 121 0888-7543/01 $35.00 Genomics 74, 121–128 (2001) Copyright © 2001 by Academic Press doi:10.1006/geno.2001.6513 All rights of reproduction in any form reserved.