Semantically Improved Genome-Wide Prediction of Gene Ontology Annotations Marco Masseroli, Marco Tagliasacchi, Davide Chicco Dipartimento di Elettronica e Informazione Politecnico di Milano Piazza Leonardo da Vinci 32, 20133 Milano, Italy Email: masseroli@elet.polimi.it, tagliasacchi@elet.polimi.it, davide.chicco@elet.polimi.it Abstract—Genomic annotations describing the structural and functional features of genes and gene products by means of controlled terminologies and ontologies are extremely valuable, in particular for computational analyses aimed at inferring new biomedical knowledge, which usually rely on available annotations. Yet, they are incomplete, especially for more recently studied genomes, and only some of the available annotations represent highly reliable human curated information. In order to help and speed up the time-consuming curation process and improve the available annotations, computational methods that are able to provide a prioritized list of predicted annotations are hence extremely useful. Starting from a previous work on the automatic prediction of Gene Ontology annotations based on the singular value decomposition (SVD) of the gene-to-term annotation matrix, in this work we propose a novel predic- tion algorithm that incorporates gene clustering based on gene functional similarity computed by means of the Gene Ontology annotations. We tested the prediction methods performing k-fold cross-validation on the genomes of two organisms, Saccharomyces cerevisiae (SGD) and Drosophila melanogaster (FlyBase). Results demonstrate the effectiveness of our approach. Index Terms—Annotation prediction; Singular Value Decom- position; gene similarity metrics I. I NTRODUCTION In molecular biology, new approaches are providing un- precedented amount of valuable data that foster the increasing relevance of molecular medicine in health care research and practice. In particular, high-throughput microarray technolo- gies allow quickly and simultaneously studying thousands of genes and gene products. At the same time, advancements in information technologies and biomedical informatics are providing tools and techniques to manage the amount of biomedical data produced, as well as many methods for their analysis. In addition, biomedical domain experts are increasingly annotating biomolecular entities, mainly genes and their protein products, with controlled terminologies and ontologies describing their structural, functional and pheno- typic biological features. Currently, several controlled vocabu- laries are routinely used to annotate genes and proteins. Some of them have a flat structure, i.e. no explicit relationships between the terms composing the vocabulary exist. Others are part of ontologies, where semantic relationships are defined between pairs of terms. The most widely used ontology for annotating biomolecular entities is the Gene Ontology (GO) [1]. It comprises three ontologies that hold a total of nearly 26,000 controlled terms describing specie-independent biological process (BP), molecular function (MF) and cellular component (CC) attributes of genes and gene products. Each GO ontology is designed to capture orthogonal aspects of genes and gene products, and it is structured as a directed acyclic graph (DAG) of terms hierarchically related mainly through ”is a” or ”part of” relationships. An edge exists from a child term a to its parent term b if a ”is a” specific instance of b or it is ”part of” b. Furthermore, in each GO DAG it exists a unique root, which is defined as the DAG node without parents, and each term can have multiple parents. Annotation databases contain the biological knowledge that has been gathered over the years, and provide such valuable data as public repositories. Despite their relevance, there are important issues that affect annotation databases [2]. In particular, first, the annotations are not exhaustive: only a subset of genes and gene products of sequenced organisms is known and, among those, only a small fraction has been annotated so far. Furthermore, annotation profiles might be incomplete, because the biological knowledge about the func- tions associated with a gene or a gene product might be yet to be discovered, or the evidence already available in the literature has not been entered into the database yet. Second, available annotations might be incorrect, e.g. those inferred from electronic annotations without the involvement of a human curator. In this context, the contributions of computational tools able to analyze data stored in annotation databases are manifold. For example, it is possible to assess the relevance of inferred annotations, or produce a ranked list of missed annotations in order to speed up the curation process. Furthermore, since most of the bioinformatics analyses currently performed on genomic and proteomic data rely on the available annotations of genes and gene products, an improvement of such annota- tions both in quantity, coverage and quality is paramount to obtain better results in these analyses. A few years ago, King et al. [3] proposed the use of decision trees and Bayesian networks for predicting annotations by learning patterns from available annotation profiles. Recently, Tao et al. [4] proposed to use a k-nearest neighborg (k- NN) classifier, whereby a gene inherits the annotations that