Current Trends in Technology and Science ISSN : 2279-0535 8thSASTech 2014 Symposium on Advances in Science & Technology-Commission-IV Mashhad, Iran Copyright © 2014 CTTS.IN, All right reserved 28 Introducing an Automated Technique for Bilingual Plagiarism detection of English-Persian Documents Soraya Enayati Shiraz Master Computer Engineering, Islamic Azad University , Science and Research Branch of Tehran (Semnan),semnan, Iran, enayatishiraz@yahoo.com Farzin Yaghmaee Department of Electrical and Computer Engineering, University of Semnan, Semnan, Iran f_yaghmaee@semnan.ac.ir Abstract — Easy access that Internet has provided to vast quantities of electronic data, textual plagiarism has become a major concern especially in academic documents and research and scientific institutions. So with increasing rate of amount of information on the Internet, the problems and disorders resulting from bilingual plagiarism have resulted in solutions that they can be detected with auto-detection methods. The recommended detection methods are mostly used for bilingual plagiarism in English- Spanish, Bengali, German, French, and Vietnamese etc. In this paper, a method has been proposed that based on overall reliance of textual contents provides and using vector space model (VSM) can automatically detect bilingual plagiarism English-Persian. Method after implementation assessed using test texts through accuracy and recalling standards and F-criterion reliability and the results showed that the proposed method can detect English - Persian bilingual plagiarism with accuracy criteria of 0.88 and a reliability of 0.91. Keyword — Plagiarism, bilingual plagiarism detection, similarity analysis, morphological analysis, vector space model (VSM). 1. INTRODUCTION One of the fast and easy ways to access updated academic and research resources around the world is through Internet which along with its advantages, it has its own disadvantages as well. Including its disadvantages we can point out to easier stealing the scientific researchers’ literatures by jobber people and this technique is also a growing challenge in the virtual world. Plagiarism means re-use of the ideas, results and/or the words of another person who has presented them for the first time without explicitly mentioning the references and authors [1,2]. "Textual" Plagiarism is one of the most common types of plagiarisms which mostly take place in universities and official organizations and today with the increasing amount of information they are detectable using automated and sophisticated methods. [2] In a general categorization, language text plagiarism detection is categorized into two categories: the first category is monolingual and/or homogenous English in comparison with English; the second category is cross-lingual or heterogeneous textual plagiarism detection such as English with Persian [3, 5]. Monolingual textual plagiarism detection techniques are mostly based on lexical, grammatical, semantic, structural characteristics of text and include methods based on textual comparison, discipline matching, writing style and fingerprint which are easily identifiable. [2,3,4]. But bilingual plagiarism detection methods have two levels, in first level the two languages have the same grammar and in second level the two languages do not have the same grammar. In case the grammars are not the same, the detection method becomes far more complex. Therefore texts classifications should be carried out in such a way that the words do not have dependency on the sentence level. Therefore bilingual text plagiarism detection is based on natural language processing techniques and machine learning techniques that in some of the methods text paragraphs are classified using some techniques and then with analysis on the paragraphs of the two texts and their similarities, the plagiarism can be recognized between them. In recent years, information retrieval techniques (based on word/Character n-gram (CNG), meaning-based, dictionary-based and bilingual website with ontology) [4,6,11,12], statistical methods [7], the method of support vector machine [8], statistical data analysis method and analysis of overall dependency of textual content [9,10] have been used. In the mentioned methods, the CNG-based method namely the method based on matching of the fields has a high accuracy but if the text volume is large or if the two languages do not have the the same grammar, it is not desirable. The differences between English and Persian grammars, sentence structure and role of the words in a sentence, the problems in automatic translation from Persian to English and vice versa, Persian writing problems due to lack of standardized writing namely different uses for various forms of writing the words are the inherent problems of Persian language. Therefore, these problems have led to a condition in which texts translation is carried out based on text words and analysis of similarity based on overall dependence of the contents of the text. In this paper a method has been proposed based on overall dependence of the text content [9, 10], using ineffective words removal, finding words roots with