A Corpus for Analyzing Text Reuse by People of Different Groups Notebook for PAN at CLEF 2015 Waqas Arshad Cheema, Fahad Najib, Shakil Ahmed, Syed Husnain Bukhari, Abdul Sittar, and Rao Muhammad Adeel Nawab Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Pakistan. waqascheema06@gmail.com, choudharyfahad@gmail.com, shakil.ahmed@ciitlahore.edu.pk, husnain.syed@live.com, abdulsittar72@gmail.com, adeelnawab@ciitlahore.edu.pk Abstract Plagiarism; an un-attributed reuse of text, is very significant problem specifically for higher education institutions. Consequently, a number of auto- mated plagiarism detection system have been developed to cater this problem. The comparison of these automated plagiarism detection systems is difficult sue to problem in collecting real cases of plagiarism by students / scholars. This paper describes development of corpus containing simulated cases of plagiarism by the people having different level of writing skills. This corpus will be a very valuable addition in the set of evaluation resources presently available for comparison of plagiarism detection systems. 1 Introduction The un-acknowledged reuse of information is generally known as plagiarism [9]. Pla- giarism is acknowledged as a significant & increasing problem in higher education [7] [11] [20] [12] [5]. Resultantly, plagiarism & its detection has recently received much attention [1] [8] [21] and higher education institutions are now using automated systems to detect plagiarism in students’ / scholars’ work. Numerous approaches for plagiarism detection are available [2] [19]. However, one of the barriers preventing a comparison among techniques is the lack of a standardised evaluation resource. This corpus will be a valuable addition to the set of existing corpora for the pla- giarism detection task. This corpus, (1) can be used for comparison & evaluation of different techniques for plagiarism detection, (2) will help in further research in the field, (3) will be very helpful in understanding the strategies used by students / scholars when they plagiarise. The aim of this corpus collection is to investigate how text is reused by students / scholars while writing an article, and to determine whether algorithms can be dis- covered to detect and quantify such reuse automatically. It is hoped that results will generalise beyond the text reuse & plagiarism in academia and provide broader insights into the nature of text derivation and paraphrase; but the selected scenario provides an ideal initial case study, and one with considerable potential practical application.