GOMIT: A Generic and Adaptive Annotation Algorithm Based on Gene Ontology Term Distributions In-Yee Lee 1,2 , Jan-Ming Ho 2 , Ming-Syan Chen 1 1 Department of Electrical Engineering, National Taiwan University, Taiwan 2 Institute of Information Science, Academia Sinica, Taiwan E-mail: iylee@iis.sinica.edu.tw Abstract We address the issue of providing highly informative annotations using information revealed by the structured vocabularies of Gene Ontology (GO). For a target, a set of candidate terms used to infer the target’s property is collected and forms a unique distribution on the GO directed acyclic graph (DAG). We propose a generic and adaptive algorithm— GOMIT, which bases on term distributions and GO hierarchical characteristics to assign correct annotations for a target. We establish a quantitative model with parameters that can be trained for optimal performance for different applications. We propose several criteria for evaluating GOMIT’s performance, and conducted three experiments involving a) automated functional annotations, b) biological annotations of microarray data clusters and c) protein family GO assignments. In these experiments, we used our proposed criteria to compare GOMIT with other algorithms. Results not only reflect GOMIT‘s generality and adaptability, but also suggest that GOMIT is better or comparable to other works for assigning correct annotations. 1. Introduction High-throughput genomic sequencing methods have been used to create large repositories of public domain data. To improve knowledge sharing efficiency, large databases are being populated with data that gain value only after annotation tasks are performed. Uncharacterized sequence annotation entails either human effort or automated methodology [6]. Both human and machine-generated sequence annotations are difficult to manage electronically. Many designers and researchers are therefore promoting the idea of using ontology-based annotations, since ontologies provide controlled or structural vocabularies that allow for standardized annotation mechanisms. The Gene Ontology Consortium [14] maintains a Gene Ontology (GO) consisting of sets of domain-specific vocabularies for describing molecular characteristics across several organisms. The three ontologies of the GO term hierarchy are molecular functions, biological processes, and cellular components. GO terms and their associated “is-a” and “part-of” relations form directed acyclic graphs (DAGs) in which a parent node describes functions exhibited by its child nodes. Terms that are lower in height (i.e., closer to the root) describe more general functions; the greater the height, the more specific the function. The purpose of this research can be explained in terms of a well-known application. In an automated functional annotation of an uncharacterized sequence s , one can use homologue identification (e.g., sequence similarity) algorithms to derive a set { } | is a homologue of s H h h s = . Each homologue s h H ∈ is associated with a set ( ) Term h of GO terms that describe the functional properties of h . Let { } ( )| s s T Term h h H = ∈ ∪ . Based on GO characteristics, not only the terms in T s but also the terms inferred from T s are meaningful for annotating s . Therefore, the structured GO graph is ideal for identifying more informative annotations. On the GO graph, T s and its ascendant terms form a unique DAG distribution. The distribution can be analyzed to determine a set M s of GO terms that best represent the properties of s . We refer M s as the most informative terms. This eliminates the need to present an entire set T s to biologists; presenting an entire set increases the potential for confusion and makes it harder for biologists to determine which GO term within T s is the most significant, especially when |T s | is large. Using T s ’s distribution on the GO DAG, the GOMIT algorithm identifies a set M s of terms from T s