Documents Similarities Algorithms for Research Papers Authenticity Izzat Alsmadi CIS department, IT faculty Yarmouk University Irbid, Jordan ialsmadi@yu.edu.jo Zakaria Issa Saleh MIS department, IT faculty Yarmouk University Irbid, Jordan zzaatreh@yu.edu.jo Abstract—Studying documents similarity can have several fields of applications. In this paper, we focused on evaluating documents’ similarity to predict possible plagiarism in research papers. We evaluated the usage of several document similarity algorithms such as: Cosine, Dice, Manhattan, Euclidian, etc. We also tried several approaches of selections for the length of characters or words as a baseline for the search algorithms. Preprocessing steps were necessary to remove several types and categories of stop words that may bias the similarity measurement algorithms. Some of the algorithms are developed to search through local file and others are developed to search through the Internet. Results showed that there is a great deal of trade off between the two conflicting criteria: accuracy and performance. Keywords- Text mining, plagiarism; documents similarity, and string searc. I. INTRODUCTION The amount of possible plagiarism in research publications may vary in its seriousness. Plagiarism can be in wording through copying statements or paragraphs from other research papers. It can be also in copying ideas (i.e. semantic plagiarism) where methods for comparing text or statements similarity may not work well in discovering such plagiarism. In this area, there are conflicting opinions on the levels under which a paper can be classified as plagiarism or not. Double publication is another related problem whether the same author is possibly publishing the same or similar idea in more than one research publication channel. Languages are also barriers that put challenges on detecting possible plagiarism where some authors may publish the same paper two times in two different languages especially in journals that are exclusively published in one language. Documents similarity can have several areas of applicability. Besides, inspecting possible plagiarism which is the focus of this paper, identifying similar documents can be used to improve searching facilities by keeping fewer documents to search for or within and giving the users less to browse through. File synchronization is another important application for users who keep files on several machines (e.g., work, home, etc). Document similarities can be either on language based or syntactic. However, similarity can be more complex to include semantic similarity despite the fact that words and statements may not be the same or similar. Techniques to detect documents similarity In this area, there are many methods to judge similarity between documents. A brute force approach will compare the subject document with investigated documents word by word. However, in most cases, such approach is time and resources’ consuming. In addition, such approach can be easily fooled through editing a small number of words in the document. A more effective approach depends or is based on metrics related to the documents such as the number of statements, paragraphs, punctuation, etc. [2, 3]. A similarity index is calculated to measure the amount of similarity between documents based on those metrics. Comparing the approach of taking the document word by word in comparison to statement or paragraph by graph for example can have several contradicting trade offs. On one side, word by word comparison can minimize the effect of changing one or a small number of words relative to the total document. However, this can be time consuming and word to word document similarity may not necessarily means possible plagiarism especially if the algorithm did not take the position of the words into consideration. Documents’ similarity can be classified in different categories. In one classification, they can be classified into: word based, keyword based, sentence based, etc. Sentence or paragraph by paragraph approach is also affected by several variances such as the difference in size between the compared documents and the amount of words edited in those statements or paragraphs. © ICCIT 2012 210