Short Paper
ProSpecTome: a new tagged corpus for protein named entity
recognition
Renata Kabiljo
1,*
, Diana Stoycheva
2
and Adrian J Shepherd
1
1
School of Crystallography, Birkbeck, University of London, Malet Street, London WC1E 7HX, UK
2
University of
Heidelberg, Institute of Pharmacy and Molecular Biotechnology, Im Neuenheimer Feld 264, D-69120 Heidelberg,
Germany
ABSTRACT
Motivation: This work grew out of our ongoing research on protein-
protein interactions, in particular our desire to use one or more existing
protein taggers for highlighting putative proteins in free text (as part of a
larger interaction-mining system). Our primary motivation for developing a
new evaluation corpus (i.e. a corpus designed not to be used for training
purposes) was that we were unable to find an existing evaluation corpus
that would enable us to carry out an independent comparative analysis of
tool performance aligned to our application.
Results: We have produced a new protein-specific corpus – ProSpecTome
– that is designed to facilitate the fair evaluation of protein taggers. It has
been compiled by re-annotating a subset of the MEDLINE abstracts from
the widely-used JNLPBA evaluation corpus. ProSpecTome combines a
number of desirable features that are not shared by any other single corpus:
it explicitly annotates names of proteins, but not non-coding genes; it
incorporates two levels of specificity with regard to the category protein
(with general references to proteins annotated separately from the names of
individual proteins and protein families); the annotation guidelines used to
produce the corpus together with the degree of inter-annotator agreement
associated with its production are explicitly documented; and it is provided
in a convenient XML format (with accompanying stylesheet so that the
corpus can be easily displayed in a web browser).
Availability: The ProSpecTome corpus and associated annotation
guidelines are freely available and can be downloaded from
http://textmining.cryst.bbk.ac.uk/ProSpecTome/.
Contact: r.kabiljo@mail.cryst.bbk.ac.uk
1 INTRODUCTION
Protein Named Entity Recognition (NER) is of vital
importance to a number of biomedical text-mining tasks
such as the extraction of functional annotations and
information about protein-protein interactions from the
literature. There are a number of freely available protein
taggers (i.e. tools that aim to automatically mark up protein
names in natural language texts), and it is clearly desirable
that we should be able to independently and reliably
evaluate the performance of these tools.
Biomedical corpora in which protein names have been
manually annotated play a vital role in the development and
subsequent evaluation of protein taggers. Most protein
taggers have been trained and/or tested using one or more of
the following corpora: GENIA (Kim et al, 2003),
GENETAG (Tanabe et al., 2005) and Yapex (Franzén et al.,
2002). However, it is clear from a comparison of these
corpora that there is considerable disagreement about what
entities should be annotated as proteins. This diversity
partly reflects the inherent complexity of the domain, but
also the range of possible applications these corpora are
designed to support. Moreover, even when two corpora
agree that a given entity is a protein, there is often
disagreement about where the boundaries (i.e. start-point
and end-point) of the protein name are located within the
text. Consequently, the performance of given tool is likely
to vary considerably depending on which corpus it was
trained on and, crucially, which corpus is used in its
evaluation.
Here we introduce a new corpus – the ProSpecTome
corpus – annotated exclusively with protein names. In
designing ProSpecTome, we have re-annotated a subset of
the JNLPBA (Joint Workshop on Natural Language
Processing in Biomedicine and its Applications) evaluation
corpus (Kim et al., 2004) described below in section 2.1.
We have chosen to re-annotate part of an existing corpus,
rather than start afresh with a new set of texts, for two main
reasons.
Firstly, as its name suggests, the JNLPBA evaluation
corpus has been deliberately “reserved” for evaluation
purposes. Assuming the developers of protein taggers
respect this intention, it is reasonable to assume that taggers
will not have been trained on the data in this corpus.
Consequently this set of texts is a natural choice when the
aim is to develop a corpus for performing a fair evaluation
of multiple taggers, since clearly we cannot perform a fair
evaluation by testing a tool on the same data that was used
to train it.
Secondly, we believe that having two contrasting sets of
annotations for the same set of texts is valuable in its own
right. By comparing the performance of a protein tagger on
both the JNLPBA evaluation corpus and ProSpecTome, we
can quantify effects that are attributable to the choice of
annotation conventions in isolation from those attributable
to differences in the use of language within the texts
themselves.
In designing ProSpecTome, we have aimed to adopt good
practices relevant to the development of biomedical
corpora. Both Mani et al. (2005) and Cohen et al. (2005)
stress the desirability of providing explicit annotation
guidelines and assessments of inter-annotator agreement.
These topics are addressed below in sections 3.2 and 3.3
respectively.
© Oxford University Press 2005 1