New Page 1 Duplication in Corpora Nadjet Bouayad-Agha and Adam Kilgarriff Information Technology Research Institute University of Brighton Lewes Road Brighton BN2 4GJ, UK email: first-name.last-name@itri.bton.ac.uk We investigate duplication, a pervasive problem in NLP corpora. We present a method for finding it that uses word frequency list comparisons and experiment with this method on different units of duplication. 1. Introduction Most corpora contain repeated material. In sampled corpora like the Brown Corpus, duplication is not so much of an issue, since the linguistic data is carefully selected proportionally by genre and thus the risk of introducing unwanted duplication is reduced. However, the typical corpus used in NLP is one in which as much data as possible of the desired genre is gathered. The result is a corpus whose nature and content is rather unknown. This issue has not, to our knowledge, been previously discussed in the literature. While we may expect the repeated occurrence of words or expressions to reflect their use in the language, the repetition of longer stretches of printed material (section-, paragraph- or even sentence-length) most likely do not. Text processing technology allows writers to cut and paste any length of text. Text duplication arises for many reasons: the newspaper that reproduces an article from a weekday edition in the week-end edition, the famous quote that gets cited in every paper of a research community, the warning message that appears at the top of every instruction manual, etc. This is all valid corpus data. However, data duplication can be critical for corpus statistics. We present a method for finding duplicated material and we evaluate it against a corpus of Patient Information Leaflets (PILS). PILS are those inserts that accompany medicines and which contain information about how to take the medicine, the ingredients, contra-indications, side-effects, etc. The corpus was compiled for a text generation project whose aim was to generate PILS in multiple languages. This means that some of the duplicated material might be used as canned text in our generator. The PILS corpus is presented in the next section with a brief evaluation of its duplication and the problems it can pose. Section 3 presents a method for finding duplication using word frequency lists and section 4 reports on experiments looking for duplications in the PILS corpus. 2. The Corpus The source of the corpus is the ABPI (Association of British Pharmaceutical Industry) (1997) Compendium of PILS. It consists of 546 leaflets (650,000 words) organised by company. There are Page 1