A Hybrid Algorithm for Identifying and Categorizing Plagiarised Text Documents Victor U. Thompson and Christo Panchev Abstract: Advancement in internet technology has made information resources more readily available and much easier for plagiarism to be carried out. Detecting plagiarism is by no means a trivial task because of the sophisticated tactics by which plagiarist disguise their sources. In this paper we present a hybrid algorithm for identifying and categorizing plagiarised text documents. We built our algorithm by combining the potentials of three standard textual similarity measures used in information retrieval (IR). We used the back propagation neural network (BPNN) for combining the measures and the PAN@Clef 2012 text alignment corpus for experimental purpose. We experimented with four categories of plagiarism with each category representing a degree of textual similarity. We measured performance in terms of precision, recall and f- measure. Comparative analysis using the same corpus revealed that our hybrid algorithm (HA) outperformed each of the base similarity measures (BSM) in detecting three out of the four categories of plagiarism, and stood at a virtual tie in the fourth category: [highly similar: HA-96.6183%, BSM-96.5517%, lightly reviewed: HA-84.1321%, BSM-80.9636%, heavily reviewed: HA-68.1188%, BSM-67.1255%, highly dissimilar: HA-70.6280%, BSM-69.7%]. Keywords: Information Retrieval, Plagiarism Detection, Similarity Measures, Artificial Neural Networks. I. INTRODUCTION he rapid advancement in internet technology has brought about different forms of abuse of information resources such as document duplication, mirroring of websites and plagiarism [1]. Plagiarism is a problem that is often talked about in the academic and commercial sectors, and detecting plagiarism has received considerable attention by researchers in IR and natural language processing (NLP). Plagiarism is the act of copying or duplicating someone’s information without referencing or acknowledging the information source or author. There are two frequently mentioned solutions to the problem of plagiarism; they include prevention and detection [2]. Preventing plagiarism means restricting access to websites and materials that could be easily used for plagiarism, enforcing strict laws that would make plagiarism a crime rather than just an ethical matter and educating students on proper referencing/citation in order to avoid plagiarism [3]. V. U. Thompson is currently a Phd student in the Department of Computing, Engineering and Technology, University of Sunderland, Edinburgh Building, Chester Road, Sunderland SR1 3SD (Victor.Thompson@research.sunderland.ac.uk). C. Panchev is a senior lecturer in the Department of Computing, Engineering and Technology, University of Sunderland, Edinburgh Building, Chester Road, Sunderland SR1 3SD (christo.panchev@sunderland.ac.uk ). Detecting plagiarism on the other hand involves building automated systems that are capable of detecting plagiarised documents with a reasonable level of accuracy. This paper is in line with detecting plagiarism using automated systems. Several techniques have been proposed in the literature for detecting plagiarised documents [4], [5], [6], [7]. Most of these techniques are based on measuring instances of overlaps (overlapping features) between text documents using some form of similarity measurement [8]. These approaches could be classified into three broad categories namely fingerprinting [4], [2], [9], [10], Vector space model (VSM) and ranking [5], [6], [12] and n-gram overlap [7], [11]. See section 2 for details about these approaches. In most of the approaches used for detecting plagiarism, similarity measures are applied at some point to measure the degree of textual similarity between document pairs, and as argued by Zobel and Hoard [5], some similarity measures are not well suited for some similarity measurement problems. However, it is worth noting that similarity measures function differently [13], and have different potentials. It is therefore likely that a combination of two or more similarity measures will result in a better algorithm (measure) than any of the single measures used in the combination. The question then becomes, how do we combine similarity measures into a hybrid algorithm that performs equal to or greater than the single similarity measures used? In this study, we combined the potentials of three standard similarity measures (Cosine similarity, Jaccard index, Pearson correlation coefficient) into a hybrid algorithm that can automatically search, identify and categorise plagiarised documents based on degree of textual similarity. We worked on four categories of plagiarism taking from the PAN@Clef 2012 text alignment corpus (highly similar, lightly reviewed, heavily reviewed and highly dissimilar plagiarism categories). We compared documents in vector space and used ranking method to retrieve similar documents. We used Artificial Neural Network (ANN) technology to combine the similarity measures and to categorize document pairs based on degree of textual similarity. We measured performance in terms of precision, recall and f-measure; we also measured the error rate of the BPNN by computing its confusion matrix (which is a measure of how often the BPNN misclassifies). We concluded by comparing our hybrid algorithm with the base similarity measures. II. PREVIOUS RESEARCH This section discusses approaches that have been successfully used in the literature for identifying plagiarised documents. Popular approaches include vector space model (VSM) [14] and ranking [6], [5], fingerprinting [4], [2], [9], [10], and n-gram overlap [7], [11]. The VSM approach T Proceedings of the World Congress on Engineering 2015 Vol I WCE 2015, July 1 - 3, 2015, London, U.K. ISBN:978-988-19253-4-3 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) (revised on 5 May 2015) WCE 2015