Plagiarism Detection without Reference Collections Sven Meyer zu Eissen, Benno Stein, and Marion Kulig Faculty of Media, Media Systems Bauhaus University Weimar, 99421 Weimar, Germany sven.meyer-zu-eissen@medien.uni-weimar.de benno.stein@medien.uni-weimar.de Decker and Lenz (Eds.): Advances in Data Analysis Selected Papers from the 30th Annual Conference of the German Classification Society (GfKl) Berlin, ISBN 978-3-540-70980-0, pp. 359-366, c Springer 2007. Abstract. Current research in the field of automatic plagiarism detection for text documents focuses on the development of algorithms that compare suspicious doc- uments against potential original documents. Although recent approaches perform well in identifying copied or even modified passages [Brin 1995, Stein 2005], they as- sume a closed world where a reference collection must be given [Finkel 2002]. Recall that a human reader can identify suspicious passages within a document without having a library of potential original documents as reference in mind. This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. This paper contributes right here; it proposes a method to identify potentially plagiarized passages by analyzing a single document with respect to changes in writing style. Such passages then can be used as a starting point for an Internet search for potential sources. As well as that, such passages can be preselected for inspection by a human referee. Among others, we will present new style features that can be computed efficiently and which provide highly discriminative information: Our experiments, which base on a test corpus that will be published, show encouraging results. 1 Introduction A recent large-scale study on 18,000 students by McCabe reveals that about 50% of the students admit to plagiarize from extraneous documents [10]. Plagiarism in text documents happens in several forms: one-to-one copies, passages that are modified to a greater or lesser extent, or even translated passages. Figure 1, which is taken from [15], shows a taxonomy of plagiarism delicts along with possible detection methods. 1.1 Some Background on Plagiarism Detection The success of current approaches in plagiarism detection varies according to the underlying plagiarism delict. The approaches stated in [1,6] employ