A Tool for Text Comparison Scott S.L. Piao and Tony McEnery Lancaster University Abstract Text reuse is commonplace in academia and the media. An efficient algorithm for automatically detecting and measuring similar/related texts would have applications in corpus linguistics, historical studies and natural language engineering. In an effort to explore the issue of text reuse, a tool, named Crouch 1 , has been developed based on the TESAS system (Piao 2001) for comparing and measuring text similarity and derivation in sets of texts. Given a set of candidate source and derived texts, this tool maps related sentences between a pair of texts employing n-gram, stemming and synonym matching approaches. Crouch examines the textual similarity of individual pairs of texts, and also clusters pairs of texts in a collection of texts according to their similarity. The comparison is directional, in that the comparison from a derived text to its source generally produces a higher score than a comparison in the opposite direction. This presents the possibility of detecting the direction of text derivation. The tool displays its comparison of a given pair of texts in a graphical interface to help users to analyse the texts. Furthermore, as the tool is written in Java and fully supports Unicode, it can be applied to many languages. At Lancaster University, it is currently being used to help detect related English newspaper articles in 17th century newspapers. 1. Introduction It is common practice in the publishing world to reuse text in producing other texts. This practice has grown increasingly common as large amounts of electronic texts have become readily accessible to anyone possessing a networked computer. While text copying and reuse might be regarded by some as a form of plagiarism, it is often a completely legal practice. For example, in the media industry, journalists subscribe to newswire services, such as the UK Press Association (PA), and quite legitimately reuse/modify texts released from these services when writing newspaper reports (Gaizauskas et al. 2001, Clough et al. 2002a). Today, on the Internet, large numbers of texts carrying similar or related content are being produced everyday, some of which are produced by reusing other texts. Newspaper texts, in particular, provide interesting material for observing and gaining insight into the practice of text reuse. An efficient algorithm for automatically detecting such texts and measuring relations between them can be useful for both academic research and practical language engineering tasks. Yet, text reuse in English newspapers is as old as the newspaper industry itself. Our goal in developing Crouch was to explore text reuse in early English newspapers 2 , called newsbooks, produced in the English Commonwealth 3 . In this paper we describe a tool, named Crouch, developed for this purpose at Lancaster University. Crouch is based on a text-comparison tool, TESAS, which was developed in the METER Project in Sheffield University to identify British newspaper articles reusing texts released by the PA (Piao 2001). Written in Java code, Crouch fully supports Unicode and can potentially be used on many languages. 2. Related works Recently, a number of related works have explored issues related to text reuse. For example, Parker and Hamblen (1989) tested several algorithms for detecting student plagiarism in program assignments. Mander (1994) described a tool, called sif, which, given a query text, can find similar texts from a large collection of texts. Brin et al. (1995) designed a system based on sentence overlaps, named COPS, which can detects copies or partial copies of chunks across texts. Shivakuma et al. (1995) suggested a scheme for text reuse identification, named SCAM, based on word occurrence frequencies. Their similarity metric reflects both relative frequencies of words and text subsets overlapping. Another 1 The package is named after John Crouch, an early English satirist, Royalist and newsbook publisher in the English Commonwealth. 2 The work outlined in this paper was supported by the British Academy, grant reference SG-33825. 3 While these are not the earliest English newsbooks, as newsbooks of sorts appeared in the reign of Henry VIII, they do come from a period in which the newsbook had become a relatively popular and stable genre of writing. See Cranfield (1978) for an excellent history of the early English newsbooks. 637