IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004 1545 SharedInformationandProgramPlagiarismDetection Xin Chen, Brent Francia, Ming Li, Member, IEEE, Brian McKinnon, and Amit Seker Abstract—A fundamental question in information theory and in com- puter science is how to measure similarity or the amount of shared in- formation between two sequences. We have proposed a metric, based on Kolmogorovcomplexity,toanswerthisquestionandhaveprovenittobe universal.Weapplythismetricinmeasuringtheamountofsharedinfor- mation between two computer programs, to enable plagiarism detection. We have designed and implemented a practical system SID (Software In- tegrityDiagnosissystem)thatapproximatesthismetricbyaheuristiccom- pression algorithm. Experimental results demonstrate that SID has clear advantages over other plagiarism detection systems. SID system server is onlineathttp://software.bioinformatics.uwaterloo.ca/SID/. Index Terms—Kolmogorov complexity, program plagiarism detection, shared information. I. INTRODUCTION A common thread between information theory and computer science is the study of the amount of information contained in an ensemble [18], [19] or a sequence [9]. A fundamental and very practical question has challenged us for the past 50 years: Given two sequences, how do we measure their similarity in the sense that the measure captures all of our intuitive concepts of “computable similarities”? Practical reincar- nations of this question abound. In genomics, are two genomes similar? On the Internet, are two documents similar? Among a pile of student Java programming assignments, are some of them plagiarized? This correspondence is a part of our continued effort to develop a general and yet practical theory to answer the challenge. We have pro- posed a general concept of sequence similarity in [3], [11] and further developed more suitable theories in [8] and then in [10]. The theory has been successfully applied to whole genome phylogeny [8], chain letter evolution [4], language phylogeny [2], [10], and, more recently, classi- fication of music pieces in MIDI format [6] and extensions in database area [16]. In this correspondence, we report our project of the past three years aimed at applying this general theory to the domain of detecting programming plagiarisms. A plagiarized program, following the spirit of Parker and Hamblen [15], is a program that has been produced from another program with trivial text edit operations and without detailed understanding of the program. It is a prevailing problem in university courses with program- ming assignments. Detecting plagiarism is a tedious and challenging task for university instructors. A good software tool would help the instructors to safeguard the quality of education. More generally, the methodology developed here has other applications such as detecting Internet plagiarism. Yet, the goal of this work goes beyond a particular application. Through these efforts, together with other work [3], [11], Manuscript received March 15, 2003; revised March 18, 2004. This work was initiated at Bioinformatics Laboratory, Computer Science Department, Univer- sity of California, Santa Barbara, Santa Barbara, CA 93106, USA, and was sup- ported in part by the National Science Foundation under ITR Grant 0085801 and REU Grant. X. Chen is with the Department of Computer Science, University of Cali- fornia, Riverside, Riverside, CA 92502 USA (e-mail: xinchen@cs.ucr.edu). B. Francia and M. Li are with the School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: mli@uwaterloo.ca). B. McKinnon and A. Seker are with the Computer Science Department, Uni- versity of California, Santa Barbara, Santa Barbara, CA 93106 USA. Communicated by E.-h. Yang, Guest Editor. Digital Object Identifier 10.1109/TIT.2004.830793 [8], [4], [10], [6], we hope to develop and justify a general and practical theory of shared information between two sequences. Many program plagiarism detection systems have already been de- veloped [1], [7], [20], [23]. Based on which characteristic properties they employ to compare two programs, these systems can be roughly grouped into two categories: attribute-counting systems and structure- metric systems. A simple attribute-counting system [14] only counts the number of distinct operators, distinct operands, total number of op- erators of all types, and total number of operands of all types, and then constructs a profile using these statistics for each program. A struc- ture-metric system extracts and compares representations of the pro- gram structures; therefore, it gives an improved measure of similarity and is a more effective practical technique to detect program plagiarism [21]. Widely used systems, such as Plague [20], MOSS [1], JPlag [13], SIM [7], and YAP family [23] are all structure-metric systems. Such systems usually consist of two phases: the first phase involves a tok- enization procedure to convert source codes into token sequences by a lexical analyzer; the second phase involves a method to compare those token sequences. Note that a basic problem underlying in the second phase of a structure-metric system is how to measure similarity of a pair of token sequences. An inappropriate metric lets some plagiarisms go unnoticed, and a well-defined but nonuniversal [9] metric can always be cheated. Wise [22] presented three properties an algorithm measuring pro- gram similarity must have: a) each token in either string is counted at most once; b) transposed code segments should have a minimal effect on the resulting similarity score; c) this score must degrade gracefully in the presence of random insertions or deletions of tokens. It is dis- putable whether these three criteria are sufficient. For example, many other things should also have minimal effects: duplicated blocks, al- most duplicated blocks, insertion of irrelevant large blocks, etc. In fact, there are simply too many cases to enumerate. We will take a radically different approach. We will take one step back, away from specific applications, such as program plagiarism de- tection. We will look at an information-based metric that measures the amount of information shared between two sequences, any two se- quences: DNA sequences, English documents, or, for the sake of this correspondence, programs. Our measure is based on Kolmogorov com- plexity [9] and it is universal. The universality guarantees that if there is any similarity between two sequences under any computable similarity metric, our measure will detect it. Although this measure is not com- putable, in this correspondence we design and implement an efficient system SID (Software Integrity Diagnosis system) to approximately calculate this metric score (thus, SID may be also be defined as Shared Information Distance). These are detailed in Sections III and IV-A. In Section II we first survey related work, and in Section IV we introduce our plagiarism detection system SID. Experimental results are given in Section V. II. RELATED WORK IN PLAGIARISM DETECTION This section surveys several plagiarism detection systems. Many such systems exist; here, we only review four typical and representa- tive systems. A token in this paper refers to a basic unit in a programming language such as keywords like “if, ”“then, ” “else,” “while,” stan- dard arithmetic operators, parentheses, or variable types. A precise definition of tokens is available at http://software.bioinformatics.uwa- terloo.ca/SID/TestDef.html. A parser parses a program into a sequence of tokens. We can naturally assume that all correct parsers return identical token sequences for the same program. 0018-9448/04$20.00 © 2004 IEEE