Lecture Notes in Computer Science 1 Parallel Overlap and Similarity Detection in Semi- Structured Document Collections Krisztián Monostori, Arkady Zaslavsky, Heinz Schmidt School of Computer Science and Software Engineering Monash University, Melbourne, Australia {krisztian.monostori, arkady.zaslavsky, heinz.schmidt} @infotech.monash.edu.au Abstract. Proliferation of digital libraries plus high availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. This paper discusses the problems of using par- allel and cluster computing systems for detecting plagiarism in large collections of semi-structured electronic texts, including software written in formal lan- guages at one end of the spectrum and natural language texts at the other end. The main component of the system is using string matching algorithms and suffix trees. Implementation and performance issues are also discussed. 1. Introduction Digital libraries provide vast amounts of digitised information on-line. Preventing these documents from unauthorised copying and redistribution is a hard and chal- lenging task, which often results in not putting valuable documents on-line [8]. Copy- prevention mechanisms include distributing information on a separate disk, using special hardware or active documents [9]. One of the more recent areas of copy- detection applications is plagiarism detection. With the enormous growth of the in- formation available on the Internet users have a handy tool for "creating" assignments. With the numerous search engines users can easily find relevant articles and papers for their research. However, the Internet is a two-edge sword. Documents are available in electronic form too easily for cut-and-paste or drag-and-drop operations. Subse- quently, without tools, it may be hard to determine the amount of original work. There are several systems built for plagiarism detection including SCAM [9], Glatt [12], plagiarism.org [17] etc. SCAM and plagiarism.org are similar in their approach. They build an index on a collection of registered documents by using hashing algo- rithms and compare these hashed values. Our approach uses computations similarly to [9] and Plagiarism.org [17], but considers exact string matching algorithms rather than hashing, which has a finite probability of failure and is reported in [10]. How to iden- tify candidate documents is beyond the scope of this paper. This problem is also ad- dressed in the dScam prototype developed at Stanford University and described in [11].