Current Trends in Technology and Science
ISSN : 2279-0535
8thSASTech 2014 Symposium on Advances in Science & Technology-Commission-IV Mashhad, Iran
Copyright © 2014 CTTS.IN, All right reserved
28
Introducing an Automated Technique for Bilingual
Plagiarism detection of English-Persian Documents
Soraya Enayati Shiraz
Master Computer Engineering, Islamic Azad University , Science and Research Branch
of Tehran (Semnan),semnan, Iran, enayatishiraz@yahoo.com
Farzin Yaghmaee
Department of Electrical and Computer Engineering, University of Semnan, Semnan, Iran
f_yaghmaee@semnan.ac.ir
Abstract — Easy access that Internet has provided to
vast quantities of electronic data, textual plagiarism
has become a major concern especially in academic
documents and research and scientific institutions. So
with increasing rate of amount of information on the
Internet, the problems and disorders resulting from
bilingual plagiarism have resulted in solutions that
they can be detected with auto-detection methods. The
recommended detection methods are mostly used for
bilingual plagiarism in English- Spanish, Bengali,
German, French, and Vietnamese etc. In this paper, a
method has been proposed that based on overall
reliance of textual contents provides and using vector
space model (VSM) can automatically detect bilingual
plagiarism English-Persian. Method after
implementation assessed using test texts through
accuracy and recalling standards and F-criterion
reliability and the results showed that the proposed
method can detect English - Persian bilingual
plagiarism with accuracy criteria of 0.88 and a
reliability of 0.91.
Keyword — Plagiarism, bilingual plagiarism detection,
similarity analysis, morphological analysis, vector
space model (VSM).
1. INTRODUCTION
One of the fast and easy ways to access updated academic
and research resources around the world is through
Internet which along with its advantages, it has its own
disadvantages as well. Including its disadvantages we can
point out to easier stealing the scientific researchers’
literatures by jobber people and this technique is also a
growing challenge in the virtual world. Plagiarism means
re-use of the ideas, results and/or the words of another
person who has presented them for the first time without
explicitly mentioning the references and authors [1,2].
"Textual" Plagiarism is one of the most common types of
plagiarisms which mostly take place in universities and
official organizations and today with the increasing
amount of information they are detectable using
automated and sophisticated methods. [2] In a general
categorization, language text plagiarism detection is
categorized into two categories: the first category is
monolingual and/or homogenous English in comparison
with English; the second category is cross-lingual or
heterogeneous textual plagiarism detection such as
English with Persian [3, 5].
Monolingual textual plagiarism detection techniques are
mostly based on lexical, grammatical, semantic, structural
characteristics of text and include methods based on
textual comparison, discipline matching, writing style and
fingerprint which are easily identifiable. [2,3,4]. But
bilingual plagiarism detection methods have two levels,
in first level the two languages have the same grammar
and in second level the two languages do not have the
same grammar. In case the grammars are not the same,
the detection method becomes far more complex.
Therefore texts classifications should be carried out in
such a way that the words do not have dependency on the
sentence level. Therefore bilingual text plagiarism
detection is based on natural language processing
techniques and machine learning techniques that in some
of the methods text paragraphs are classified using some
techniques and then with analysis on the paragraphs of
the two texts and their similarities, the plagiarism can be
recognized between them. In recent years, information
retrieval techniques (based on word/Character n-gram
(CNG), meaning-based, dictionary-based and bilingual
website with ontology) [4,6,11,12], statistical methods
[7], the method of support vector machine [8], statistical
data analysis method and analysis of overall dependency
of textual content [9,10] have been used. In the
mentioned methods, the CNG-based method namely the
method based on matching of the fields has a high
accuracy but if the text volume is large or if the two
languages do not have the the same grammar, it is not
desirable.
The differences between English and Persian grammars,
sentence structure and role of the words in a sentence, the
problems in automatic translation from Persian to English
and vice versa, Persian writing problems due to lack of
standardized writing namely different uses for various
forms of writing the words are the inherent problems of
Persian language. Therefore, these problems have led to a
condition in which texts translation is carried out based
on text words and analysis of similarity based on overall
dependence of the contents of the text.
In this paper a method has been proposed based on
overall dependence of the text content [9, 10], using
ineffective words removal, finding words roots with