1 The METER Corpus: A corpus for analysing journalistic text reuse Robert Gaizauskas†, Jonathan Foster‡, Yorick Wilks†, John Arundel‡, Paul Clough†, Scott Piao† Departments of Computer Science† and Journalism‡ University of Sheffield, Sheffield, S1 4DP (contact: R.Gaizauskas@dcs.shef.ac.uk ; fax:(0114) 222 1810) Abstract As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. We envisage that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. To illustrate these potential uses we briefly describe some work we have done with the corpus to develop algorithms for detecting text reuse. 1. Introduction The aim of the METER (MEasuring TExt Reuse) project 1 is to investigate how text is reused in the production of newspaper articles from newswire sources and to determine whether algorithms can be discovered to detect and quantify such reuse automatically. It is to be hoped that results will generalise beyond the newspaper-newswire scenario and provide broader insights into the nature of text derivation and paraphrase; but the newspaper-newswire scenario provides an ideal initial case study, and one with considerable potential practical application – see below. To assist in this study it was necessary to create a comparable corpus 2 consisting of a selection of newswire texts and newspaper articles reporting the same stories, in some cases derived from the newswire texts and in some cases not. Because the Press Association, the major British domestic newswire service, is a collaborator in the METER project and have provided us with unrestricted access to their newswire service, we have used their archive as the source newswire for our corpus and texts from a variety of their subscribers in the British press as the candidate derived texts. Having assembled the corpus and annotated it to assist in our study of text reuse, we believed the corpus would be of wider interest to the corpus linguistics, natural language processing and language engineering communities, and hence decided to package and release the corpus on its own. This paper describes the design, structure and contents of the corpus, and illustrates its potential by briefly describing some experiments we have carried out using it. The METER Corpus is available free-of-charge for research purposes. It should be stressed that the METER corpus is a pioneering corpus for the study of text reuse and that as such is no doubt flawed and limited in various ways. Resource limitations have meant limiting the size of the corpus and the amount of interannotator verification carried out on the annotations. Ideas as to how it should be annotated continued to evolve during the process of annotation, which means that complete consistency across annotations has probably not been achieved. Our hope is that despite these limitations the corpus will still prove useful to others, even if only as a starting point for designing a better resource. 1 For further details of the METER project, see: http://www.dcs.shef.ac.uk/nlp/funded/meter.html. 2 Johansson et al . (1996: 3) define comparable corpus as: “corpora consisting of parallel original and translated texts in the same languages”.