IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004 1545
SharedInformationandProgramPlagiarismDetection
Xin Chen, Brent Francia, Ming Li, Member, IEEE, Brian McKinnon,
and Amit Seker
Abstract—A fundamental question in information theory and in com-
puter science is how to measure similarity or the amount of shared in-
formation between two sequences. We have proposed a metric, based on
Kolmogorovcomplexity,toanswerthisquestionandhaveprovenittobe
universal.Weapplythismetricinmeasuringtheamountofsharedinfor-
mation between two computer programs, to enable plagiarism detection.
We have designed and implemented a practical system SID (Software In-
tegrityDiagnosissystem)thatapproximatesthismetricbyaheuristiccom-
pression algorithm. Experimental results demonstrate that SID has clear
advantages over other plagiarism detection systems. SID system server is
onlineathttp://software.bioinformatics.uwaterloo.ca/SID/.
Index Terms—Kolmogorov complexity, program plagiarism detection,
shared information.
I. INTRODUCTION
A common thread between information theory and computer science
is the study of the amount of information contained in an ensemble [18],
[19] or a sequence [9]. A fundamental and very practical question has
challenged us for the past 50 years: Given two sequences, how do we
measure their similarity in the sense that the measure captures all of
our intuitive concepts of “computable similarities”? Practical reincar-
nations of this question abound. In genomics, are two genomes similar?
On the Internet, are two documents similar? Among a pile of student
Java programming assignments, are some of them plagiarized?
This correspondence is a part of our continued effort to develop a
general and yet practical theory to answer the challenge. We have pro-
posed a general concept of sequence similarity in [3], [11] and further
developed more suitable theories in [8] and then in [10]. The theory has
been successfully applied to whole genome phylogeny [8], chain letter
evolution [4], language phylogeny [2], [10], and, more recently, classi-
fication of music pieces in MIDI format [6] and extensions in database
area [16]. In this correspondence, we report our project of the past three
years aimed at applying this general theory to the domain of detecting
programming plagiarisms.
A plagiarized program, following the spirit of Parker and Hamblen
[15], is a program that has been produced from another program with
trivial text edit operations and without detailed understanding of the
program. It is a prevailing problem in university courses with program-
ming assignments. Detecting plagiarism is a tedious and challenging
task for university instructors. A good software tool would help the
instructors to safeguard the quality of education. More generally, the
methodology developed here has other applications such as detecting
Internet plagiarism. Yet, the goal of this work goes beyond a particular
application. Through these efforts, together with other work [3], [11],
Manuscript received March 15, 2003; revised March 18, 2004. This work was
initiated at Bioinformatics Laboratory, Computer Science Department, Univer-
sity of California, Santa Barbara, Santa Barbara, CA 93106, USA, and was sup-
ported in part by the National Science Foundation under ITR Grant 0085801
and REU Grant.
X. Chen is with the Department of Computer Science, University of Cali-
fornia, Riverside, Riverside, CA 92502 USA (e-mail: xinchen@cs.ucr.edu).
B. Francia and M. Li are with the School of Computer Science, University of
Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: mli@uwaterloo.ca).
B. McKinnon and A. Seker are with the Computer Science Department, Uni-
versity of California, Santa Barbara, Santa Barbara, CA 93106 USA.
Communicated by E.-h. Yang, Guest Editor.
Digital Object Identifier 10.1109/TIT.2004.830793
[8], [4], [10], [6], we hope to develop and justify a general and practical
theory of shared information between two sequences.
Many program plagiarism detection systems have already been de-
veloped [1], [7], [20], [23]. Based on which characteristic properties
they employ to compare two programs, these systems can be roughly
grouped into two categories: attribute-counting systems and structure-
metric systems. A simple attribute-counting system [14] only counts
the number of distinct operators, distinct operands, total number of op-
erators of all types, and total number of operands of all types, and then
constructs a profile using these statistics for each program. A struc-
ture-metric system extracts and compares representations of the pro-
gram structures; therefore, it gives an improved measure of similarity
and is a more effective practical technique to detect program plagiarism
[21]. Widely used systems, such as Plague [20], MOSS [1], JPlag [13],
SIM [7], and YAP family [23] are all structure-metric systems. Such
systems usually consist of two phases: the first phase involves a tok-
enization procedure to convert source codes into token sequences by a
lexical analyzer; the second phase involves a method to compare those
token sequences. Note that a basic problem underlying in the second
phase of a structure-metric system is how to measure similarity of a pair
of token sequences. An inappropriate metric lets some plagiarisms go
unnoticed, and a well-defined but nonuniversal [9] metric can always
be cheated.
Wise [22] presented three properties an algorithm measuring pro-
gram similarity must have: a) each token in either string is counted at
most once; b) transposed code segments should have a minimal effect
on the resulting similarity score; c) this score must degrade gracefully
in the presence of random insertions or deletions of tokens. It is dis-
putable whether these three criteria are sufficient. For example, many
other things should also have minimal effects: duplicated blocks, al-
most duplicated blocks, insertion of irrelevant large blocks, etc. In fact,
there are simply too many cases to enumerate.
We will take a radically different approach. We will take one step
back, away from specific applications, such as program plagiarism de-
tection. We will look at an information-based metric that measures
the amount of information shared between two sequences, any two se-
quences: DNA sequences, English documents, or, for the sake of this
correspondence, programs. Our measure is based on Kolmogorov com-
plexity [9] and it is universal. The universality guarantees that if there is
any similarity between two sequences under any computable similarity
metric, our measure will detect it. Although this measure is not com-
putable, in this correspondence we design and implement an efficient
system SID (Software Integrity Diagnosis system) to approximately
calculate this metric score (thus, SID may be also be defined as Shared
Information Distance). These are detailed in Sections III and IV-A. In
Section II we first survey related work, and in Section IV we introduce
our plagiarism detection system SID. Experimental results are given in
Section V.
II. RELATED WORK IN PLAGIARISM DETECTION
This section surveys several plagiarism detection systems. Many
such systems exist; here, we only review four typical and representa-
tive systems.
A token in this paper refers to a basic unit in a programming
language such as keywords like “if, ”“then, ” “else,” “while,” stan-
dard arithmetic operators, parentheses, or variable types. A precise
definition of tokens is available at http://software.bioinformatics.uwa-
terloo.ca/SID/TestDef.html. A parser parses a program into a sequence
of tokens. We can naturally assume that all correct parsers return
identical token sequences for the same program.
0018-9448/04$20.00 © 2004 IEEE