Documents Similarities Algorithms for Research
Papers Authenticity
Izzat Alsmadi
CIS department, IT faculty
Yarmouk University
Irbid, Jordan
ialsmadi@yu.edu.jo
Zakaria Issa Saleh
MIS department, IT faculty
Yarmouk University
Irbid, Jordan
zzaatreh@yu.edu.jo
Abstract—Studying documents similarity can have several
fields of applications. In this paper, we focused on evaluating
documents’ similarity to predict possible plagiarism in
research papers. We evaluated the usage of several
document similarity algorithms such as: Cosine, Dice,
Manhattan, Euclidian, etc. We also tried several approaches
of selections for the length of characters or words as a
baseline for the search algorithms. Preprocessing steps were
necessary to remove several types and categories of stop
words that may bias the similarity measurement algorithms.
Some of the algorithms are developed to search through local
file and others are developed to search through the Internet.
Results showed that there is a great deal of trade off between
the two conflicting criteria: accuracy and performance.
Keywords- Text mining, plagiarism; documents similarity,
and string searc.
I. INTRODUCTION
The amount of possible plagiarism in research
publications may vary in its seriousness. Plagiarism can be
in wording through copying statements or paragraphs from
other research papers. It can be also in copying ideas (i.e.
semantic plagiarism) where methods for comparing text or
statements similarity may not work well in discovering
such plagiarism. In this area, there are conflicting opinions
on the levels under which a paper can be classified as
plagiarism or not. Double publication is another related
problem whether the same author is possibly publishing
the same or similar idea in more than one research
publication channel. Languages are also barriers that put
challenges on detecting possible plagiarism where some
authors may publish the same paper two times in two
different languages especially in journals that are
exclusively published in one language.
Documents similarity can have several areas of
applicability. Besides, inspecting possible plagiarism
which is the focus of this paper, identifying similar
documents can be used to improve searching facilities by
keeping fewer documents to search for or within and
giving the users less to browse through. File
synchronization is another important application for users
who keep files on several machines (e.g., work, home, etc).
Document similarities can be either on language based
or syntactic. However, similarity can be more complex to
include semantic similarity despite the fact that words and
statements may not be the same or similar.
Techniques to detect documents similarity
In this area, there are many methods to judge
similarity between documents. A brute force approach will
compare the subject document with investigated
documents word by word. However, in most cases, such
approach is time and resources’ consuming. In addition,
such approach can be easily fooled through editing a small
number of words in the document. A more effective
approach depends or is based on metrics related to the
documents such as the number of statements, paragraphs,
punctuation, etc. [2, 3]. A similarity index is calculated to
measure the amount of similarity between documents
based on those metrics. Comparing the approach of taking
the document word by word in comparison to statement or
paragraph by graph for example can have several
contradicting trade offs. On one side, word by word
comparison can minimize the effect of changing one or a
small number of words relative to the total document.
However, this can be time consuming and word to word
document similarity may not necessarily means possible
plagiarism especially if the algorithm did not take the
position of the words into consideration. Documents’
similarity can be classified in different categories. In one
classification, they can be classified into: word based,
keyword based, sentence based, etc. Sentence or paragraph
by paragraph approach is also affected by several variances
such as the difference in size between the compared
documents and the amount of words edited in those
statements or paragraphs.
© ICCIT 2012 210