1 False Annotations of Proteins: Automatic Detection via Keyword-Based Clustering Noam Kaplan* and Michal Linial Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem 91904, Israel * To whom correspondence should be addressed Received line ABSTRACT Computational protein annotation methods occasionally introduce errors. False-positive (FP) errors are annotations that are mistakenly associated with a protein. Such false annotations introduce errors that may spread into databases through similarity with other proteins. We present a protein-clustering method that enables automatic separation of FP from true- positive hits. The method is based on the combination of each protein's annotations. Using a test set of all PROSITE signatures that are marked as FPs, we show that the method successfully separates FPs in 70% of the cases. Automatic detection of FPs may greatly facilitate the manual validation process and increase annotation sensitivity. Contact: kaplann@cc.huji.ac.il; michall@cc.huji.ac.il INTRODUCTION Computational protein annotation methods are widely used. A wide variety of annotation methods exists, many of which rely on some kind of scoring. Typically, when testing whether a protein should be given a certain annotation, a score threshold is set, and proteins that score higher than the threshold are given the annotation. Obviously, some annotation mistakes may occur. Such mistakes can be divided into false positives (FPs) and false negatives (FNs). FPs (or false hits) are annotations that were mistakenly assigned to a protein (type I error). FNs (or misses) are annotations that should have been assigned to a protein but were not (type II error). Adjustment of score thresholds allows tradeoff between these two types of mistakes. FPs annotations are considered to be of graver consequence than FNs. This is partly due to the fact that introduction of a false positive annotation into a protein database may cause other proteins to become incorrectly annotated on the basis of sequence similarity (Linial 2003; Gilks et al. 2002). A systematic evaluation of the source of false annotations that already contaminated current databases was reported (Iliopoulos et al. 2003). Several automatic systems such as PEDANT (Frishman et al. 2003) and GeneQuiz (Andrade et al. 1999) were introduced with the goal of matching the performance of human experts. Still, over interpretation, FN errors, typographic mistakes and the domain-based transitivity pitfall limit the use of such fully automatic systems for inferring protein function. Due to the importance of minimizing the amount of false annotations and maintaining highly reliable protein databases, three methods are generally used to avoid false annotations. The first method is manual validation of the annotation of each protein, which creates a serious bottleneck in the addition of new proteins and annotations to the database. The second method is using high score thresholds, thus lowering the rate of FPs but also increasing the rate of FNs. The third method is requirement for hits from different detection methods, eliminating advantages that are unique to some methods. Thus it would be beneficial to develop means by which FP annotations could be detected automatically. Here we present such a method that uses clustering of protein functional groups to separate true positives (TP) from FPs automatically. Our method is based on the following notions: (a) protein annotations represent biological properties; (b) protein functional groups share specific combinations of biological properties, essentially constituting "property clusters"; (c) if two proteins have very different combinations of annotations, they are unlikely to share a single functional annotation (a high chance that one of them was given this annotation incorrectly). These notions are not obvious, but were shown to correctly indicate false annotations in some individual cases tested manually using the graphical annotation-analysis tool of PANDORA (Kaplan et al. 2003). Still they were not tested on a wide scale, and were not applied as automatic methods. Using these ideas, the method attempts to separate a group of proteins into "property clusters", by introducing a measure that quantifies the similarity between the annotation combinations of two proteins. According to our basic notions, these clusters are likely to be in accordance with false and true hits. We tested our method on the PROSITE protein signature database (Sigrist et al. 2002). The database consists of 1189 protein signatures (essentially annotations) that were assigned to a protein database. PROSITE annotation of proteins is manually validated, stating for each protein hit whether the annotation is a TP or a FP. Out of this set of 1189 signatures, we