An Algorithm for Generating Representative Functional Annotations Based on Gene Ontology In-Yee Lee 1 , Jan-Ming Ho 1 , Wen-Chang Lin 2 1 Institute of Information Science, Academia Sinica, Nankang, Taipei, Taiwan 115, R.O.C. 2 Institute of Biomedical Sciences, Academia Sinica, Nankang, Taipei, Taiwan 115, R.O.C. 1 {iylee,hoho@iis.sinica.edu.tw}, 2 wenlin@ibms.sinica.edu.tw Abstract The authors address the issue of providing highly representative descriptions in automated functional annotations. For an uncharacterized sequence, a common strategy is to infer such annotations from those of well-characterized sequences that contain its homologues. However, under many circumstances, this strategy fails to produce meaningful annotations. Using information revealed by the structured vocabularies of Gene Ontology, we propose a quantitative algorithm to assign representative annotations. We established a confidence function that reflects both the precision and coverage of a candidate annotation, and reasoned the function's parameters from analyses of significant forms of candidate distributions on the GO graph. We tested the algorithm with our self-designed BIO101 (http://BIO101.iis.sinica.edu.tw)—an automated annotation system that supports the workflows of functional annotations for expressed sequence tags (ESTs). According to our experimental results, the algorithm is capable of producing representative and meaningful functional annotations. 1. Introduction High-throughput methods for genomic sequencing have resulted in large repositories of public domain data [17]. To make knowledge sharing and utilization more efficient, large databases are being established [11] [4] [14], but with data that gains value only after annotation tasks are performed—in other words, after functional or structural sequence properties are determined. In this paper we will propose an automated approach to the labor-intensive job of annotating uncharacterized sequences (denoted here as Seq u ). Using a sequence similarity algorithm [1], data from different organisms are analyzed to identify a set of well-characterized homologues (S hom ) for each Seq u . Based on the functional annotations associated with S hom , it is possible to construct an informative and representative Seq u annotation. Due to the heterogeneity of vocabularies and formats used by various databases, functional information is difficult to manage electronically. For this reason, many designers and researchers are promoting functional annotations based on Gene Ontology (GO) [15] [5] [7] [8] [11] [12] [13] [14] [16] [17] [18]. For instance, the Gene Ontology Consortium [15] is creating sets of domain-specific vocabularies for describing molecular characteristics across various organisms. The GO term hierarchy consists of three ontology: molecular functions, biological processes, and cellular components. Directed acyclic graphs (DAGs) are formed by GO terms and their associated “is-a” and “part-of” relations. Since GO provides not only controlled but also structured vocabularies, we believe that GO annotations yield information that can assist automated annotations. In an automated annotation, if a Seq u ’s function is identical to those of the Set hom then the annotations inferred from Set hom will be meaningful. However, there are many cases where this is not true—for example, Rosetta stone proteins [10] that have homologues associated with two different proteins, yet are fused into a single polypeptide chain. Such proteins are better annotated using functions common to both homologues, rather than any specific annotation inferred from a similar homologue within Set hom , even if the homologue exhibits a high rate of similarity with Seq u . In this case, the structured GO graph is ideal for identifying more representative annotations. The set of GO terms associated with homologues in a specific S hom (denoted S Termh ) form a unique DAG distribution. Information revealed by such distributions can be analyzed to determine which terms (candidates) best represent the functional properties of a Seq u . In a GO DAG, a parent node describes a function exhibited by all of its child nodes. Terms that are lower in height (i.e., closer to the root) describe more general functions; the greater the height, the more specific the function.