Short Paper ProSpecTome: a new tagged corpus for protein named entity recognition Renata Kabiljo 1,* , Diana Stoycheva 2 and Adrian J Shepherd 1 1 School of Crystallography, Birkbeck, University of London, Malet Street, London WC1E 7HX, UK 2 University of Heidelberg, Institute of Pharmacy and Molecular Biotechnology, Im Neuenheimer Feld 264, D-69120 Heidelberg, Germany ABSTRACT Motivation: This work grew out of our ongoing research on protein- protein interactions, in particular our desire to use one or more existing protein taggers for highlighting putative proteins in free text (as part of a larger interaction-mining system). Our primary motivation for developing a new evaluation corpus (i.e. a corpus designed not to be used for training purposes) was that we were unable to find an existing evaluation corpus that would enable us to carry out an independent comparative analysis of tool performance aligned to our application. Results: We have produced a new protein-specific corpus – ProSpecTome – that is designed to facilitate the fair evaluation of protein taggers. It has been compiled by re-annotating a subset of the MEDLINE abstracts from the widely-used JNLPBA evaluation corpus. ProSpecTome combines a number of desirable features that are not shared by any other single corpus: it explicitly annotates names of proteins, but not non-coding genes; it incorporates two levels of specificity with regard to the category protein (with general references to proteins annotated separately from the names of individual proteins and protein families); the annotation guidelines used to produce the corpus together with the degree of inter-annotator agreement associated with its production are explicitly documented; and it is provided in a convenient XML format (with accompanying stylesheet so that the corpus can be easily displayed in a web browser). Availability: The ProSpecTome corpus and associated annotation guidelines are freely available and can be downloaded from http://textmining.cryst.bbk.ac.uk/ProSpecTome/. Contact: r.kabiljo@mail.cryst.bbk.ac.uk 1 INTRODUCTION Protein Named Entity Recognition (NER) is of vital importance to a number of biomedical text-mining tasks such as the extraction of functional annotations and information about protein-protein interactions from the literature. There are a number of freely available protein taggers (i.e. tools that aim to automatically mark up protein names in natural language texts), and it is clearly desirable that we should be able to independently and reliably evaluate the performance of these tools. Biomedical corpora in which protein names have been manually annotated play a vital role in the development and subsequent evaluation of protein taggers. Most protein taggers have been trained and/or tested using one or more of the following corpora: GENIA (Kim et al, 2003), GENETAG (Tanabe et al., 2005) and Yapex (Franzén et al., 2002). However, it is clear from a comparison of these corpora that there is considerable disagreement about what entities should be annotated as proteins. This diversity partly reflects the inherent complexity of the domain, but also the range of possible applications these corpora are designed to support. Moreover, even when two corpora agree that a given entity is a protein, there is often disagreement about where the boundaries (i.e. start-point and end-point) of the protein name are located within the text. Consequently, the performance of given tool is likely to vary considerably depending on which corpus it was trained on and, crucially, which corpus is used in its evaluation. Here we introduce a new corpus – the ProSpecTome corpus – annotated exclusively with protein names. In designing ProSpecTome, we have re-annotated a subset of the JNLPBA (Joint Workshop on Natural Language Processing in Biomedicine and its Applications) evaluation corpus (Kim et al., 2004) described below in section 2.1. We have chosen to re-annotate part of an existing corpus, rather than start afresh with a new set of texts, for two main reasons. Firstly, as its name suggests, the JNLPBA evaluation corpus has been deliberately “reserved” for evaluation purposes. Assuming the developers of protein taggers respect this intention, it is reasonable to assume that taggers will not have been trained on the data in this corpus. Consequently this set of texts is a natural choice when the aim is to develop a corpus for performing a fair evaluation of multiple taggers, since clearly we cannot perform a fair evaluation by testing a tool on the same data that was used to train it. Secondly, we believe that having two contrasting sets of annotations for the same set of texts is valuable in its own right. By comparing the performance of a protein tagger on both the JNLPBA evaluation corpus and ProSpecTome, we can quantify effects that are attributable to the choice of annotation conventions in isolation from those attributable to differences in the use of language within the texts themselves. In designing ProSpecTome, we have aimed to adopt good practices relevant to the development of biomedical corpora. Both Mani et al. (2005) and Cohen et al. (2005) stress the desirability of providing explicit annotation guidelines and assessments of inter-annotator agreement. These topics are addressed below in sections 3.2 and 3.3 respectively. © Oxford University Press 2005 1