674 Abstract—Source code plagiarism is currently a severe problem in academia. In academia’s programming assignments are used to evaluate students in programming courses. Therefore, checking programming assignments for plagiarism is essential. If a course consists of a large number of students, it is impractical for a human inspector to check each assignment. Therefore, it is essential to have automated tools in order to detect plagiarism in the programming assignments. Majority of the current source code plagiarism detection tools are based on structured methods. Structural properties of a plagiarized program and the original program differ significantly. Therefore, it is hard to detect plagiarized programs with tools based on structural methods, when the plagiarism level is four or above. This paper proposes a new plagiarism detection method, which is based on the attribute counting technique. Novelty of our method is that, we have utilized a meta-learning algorithm in order to improve the accuracy of our plagiarism detection system. Index Terms—Plagiarism detection, machine learning, source code, naïve bayes classifier, k-nearest neighbor I. INTRODUCTION Detection of source code plagiarism is valuable for both the academia and industry. Zobel [1] has pointed out that, “students may plagiarize by copying code from friends, the Web or so called „private tutors‟”. Most programming courses in universities evaluate the students based on the marks of programming assignments. Therefore, it is essential to detect and prevent plagiarism at universities. Moreover Liu and et al [2] have mentioned that, “A quality plagiarism detector has a strong impact to law suit prosecution”. Therefore, there is a huge demand for accurate source code plagiarism detection systems from both the academia and industry. Woo and Cho [3] have mentioned two methods for plagiarism detection. 1) Structured Based Method: this method considers the structural characteristics of documents when developing plagiarism detection algorithms. 2) Attribute Counting Method: this method extracts various measurable features (or metrics) from documents. Extracted metrics are used as input for similarity detection algorithms. Presently most of the source code plagiarism detection algorithms are based on the structured method [3], [4], [2]. In Manuscript received June 15, 2012; revised August 1, 2012. U Bandara is with the Virtusa Corporation, Sri Lanka (e-mail: upulbandara@ gmail.com). G. Wijayarathna is with the Faculty of Science, University of Kelaniya, Sri Lanka (e-mailgamini@kln.ac.lk). addition to that there are few attempts which are based on the attribute counting method [5], [6]. Faidhi and Robinson [7] have defined a spectrum of six levels in program plagiarism. Level 0 is the lowest level of plagiarism, which represents copying someone else‟s program without modifying it. Level 6 represents the highest level of plagiarism, which is modifying the program‟s control logic in order to achieve the same operation. It is to be noted that when moving from level 0 to level 6, structural characteristics of the plagiarized program varies from the original program. Moreover, Arwin and Tahaghoghi [4] have mentioned that plagiarism detection systems which use the structured techniques rely on the belief that, the similarity of two programs can be estimated from the similarity of their structures. Since structured properties of plagiarized documents vary from its original document, it is difficult to detect plagiarism when level is four or higher. On the other hand plagiarism detection systems which are based on the attribute counting techniques do not rely on the structural properties of the source program. Therefore, they are not affected from the problem mentioned above. Presently systems which are based on the attribute counting technique are not accurate enough for practical applications [5], [6]. Therefore, we have proposed a new system which is based on the attribute counting technique and uses machine learning approach in order to detect similarities between source codes. Ethem Alpapaydin [8] has pointed out that, “There is no single learning algorithm that in any domain always induces most accurate leaner”. Further, he has mentioned that by combining multiple base learners in a suitable way the prediction performance can be improved. Therefore, instead of using just one learning algorithm, we have used three learning algorithms for training our system. We tested our system with source codes belonging to ten developers. During the training period we found out that not a single algorithm was capable of identifying the source code files belong to all the developers with adequate accuracy. But one interesting observation was that the results generated by the three algorithms were complementing each other. Therefore, we decided to use a meta-learning algorithm in order to combine the results generated by the three learning algorithms. More details about the learning algorithms and the meta-learning algorithm are given in the Research Design section. The rest of the paper is organized as follows. Section II we will be presenting plagiarism detection methods based on the attribute counting techniques. Section III we will be discussing machine learning algorithms for plagiarism detection. Section IV we will discuss training and testing our system. Finally, we will conclude our paper by discussing the Detection of Source Code Plagiarism Using Machine Learning Approach Upul Bandara and Gamini Wijayrathna International Journal of Computer Theory and Engineering, Vol. 4, No. 5, October 2012