Abstract— Source code plagiarism is a severe problem in academia. In academia programming assignments are used to evaluate students in programming courses. Therefore checking programming assignments for plagiarism is essential. If a course consists of a large number of students, it is impractical to check each assignment by a human inspector. Therefore it is essential to have automated tools in order to assist detection of plagiarism in programming assignments. Majority of the current source code plagiarism detection tools are based on structured methods. Structural properties of a plagiarized program and the original program differ significantly. Therefore it is hard to detect plagiarized programs when plagiarism level is 4 or above by using tools which are based on structural methods. This paper presents a new plagiarism detection method, which is based on machine learning techniques. We have trained and tested three machine learning algorithms for detecting source code plagiarism. Furthermore, we have utilized a meta-learning algorithm in order to improve the accuracy of our system. Index Terms— k-nearest neighbor, machine learning, naïve bayes classifier, plagiarism detection, source code I. INTRODUCTION Detection of source code plagiarism is equally valuable for both academia and industry. Zobel [1] has pointed out that “students may plagiarize by copying code from friends, the Web or so called “private tutors”. Most programming courses in universities evaluate students based on the marks of programming assignments. If a programming course consists of a large number of students, it is impractical to check plagiarism by human inspectors. Moreover Liu and et al [2] have mentioned that “A quality plagiarism detector has a strong impact to law suit prosecution”. Therefore there is a huge demand for accurate source code plagiarism detection systems from both academia and industry. Woo and Cho [3] have mentioned two methods for plagiarism detection. 1. Structured Based Method: this method considers structural characteristics of documents when developing plagiarism detection algorithms. 2. Attribute Counting Method: this method extracts various measurable features (or metrics) from documents. Extracted metrics use as input for similarity detection algorithms. Presently most of the source code plagiarism detection algorithms are based on the structured method [3, 4, 2]. In Manuscript received September 06, 2011, revised September 8, 2011. U Bandara is with the Virtusa Corporation, Sri Lanka (e-mail: upulbandara@ gmail.com). G. Wijayarathna is with the Faculty of Science, University of Kelaniya, Sri Lanka(e-mailgamini@kln.ac.lk). addition to that there are few attempts which are based on the attribute counting approach [5, 6]. Faidhi and Robinson [7] have mentioned a spectrum of six levels in program plagiarism. Level 0 is the lowest level of plagiarism. Level 0 represents copying someone else’s program without modifying it. Level 6 represents highest level of plagiarism. It represents modifying program’s control logic in order to achieve the same operation. It is to be noted that when moving from level 0 to level 6, structural characteristics of a plagiarized program varies from the original program. Moreover Arwin and Tahaghoghi [4] have mentioned that structured based systems rely on the belief that the similarity of two programs can be estimated from the similarity of their structures. Since structured properties of plagiarized documents vary from its original document it is difficult to detect plagiarism when plagiarism level is 4 or higher. On the other hand plagiarism detection systems which are based on attribute counting techniques are not relying on structural properties of the source program. Therefore they are not suffering from the problem mentioned above. But presently systems which are based on attribute counting techniques are not accurate enough for practical applications [5, 6]. Therefore we have proposed a new system that is based on attribute counting technique. Moreover we have used machine learning approach in order to detect similarity between source codes. Ethem Alpapaydin [8] has pointed out that “there is no single learning algorithm that in any domain always induces most accurate learner”. Moreover, he has mentioned that, by combining multiple base learners in a suitable way prediction performance can be improved. Therefore instead of using just one learning algorithm, we have used three learning algorithms for training our system. We tested our system with source codes belonging to ten developers. During the training period we found out that not a single algorithm was capable of identifying the source code files belonging to all the developers with adequate accuracy. But one interesting observation was that, the results generated by three algorithms were complementing each other. Therefore we have decided to use a meta-learning algorithm in order to combine the results generated by three learning algorithms The rest of the paper is organized as follows. Section II we will be presenting plagiarism detection methods based on the attribute counting techniques. Section III we will be discussing the machine learning algorithms for plagiarism detection. Section IV, we will discuss on implementation, training and testing our system. We will conclude our paper by discussing the final results and future works of our system in Section V. A Machine Learning Based Tool for Source Code Plagiarism Detection Upul Bandara, and Gamini Wijayarathna International Journal of Machine Learning and Computing, Vol. 1, No. 4, October 2011 337