Expert Systems With Applications 197 (2022) 116677
Available online 17 February 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
An external plagiarism detection system based on part-of-speech (POS) tag
n-grams and word embedding
Kadir Yalcin
*
, Ilyas Cicekli , Gonenc Ercan
Department of Computer Engineering, Hacettepe University, 06800 Beytepe, Ankara, Turkey
A R T I C L E INFO
Keywords:
Plagiarism detection
part-of-speech (POS) tagging
N-grams
Semantic similarity
Word embedding
ABSTRACT
The aim of this paper is to present an automatic plagiarism detection system to identify plagiarized passages of
documents. Our plagiarism detection system uses both syntactic and semantic similarities to identify plagiarized
passages. Our proposed method is a novel contribution because of its usage of part-of-speech tag n-grams
(POSNG) which are able to show syntactic similarities between source and suspicious sentences. Each source
document is indexed according to part-of-speech (POS) tag n-grams by a search engine in order to access rapidly
to sentences that are possible plagiarism candidates. Even though our plagiarism detection system obtains very
good results just using POS tag n-grams, its performance is further improved with the usage of semantic simi-
larities. The semantic relatedness between words is measured with the word embedding technique called
Word2Vec and the longest common subsequence approach is used to measure the semantic similarity between
source and suspicious sentences. There are several types of plagiarism such as verbatim, paraphrasing, source-
code, and cross-lingual. The high obfuscation paraphrasing is a type of plagiarism and its detection is one of
the most diffcult plagiarism detection tasks. Our proposed method, which is based on POS tag n-grams, improves
the detection performance of the high obfuscation paraphrasing type and is the main contribution of this paper.
For this study, we use the large dataset called PAN-PC-11 which is created for the evaluation of automatic
plagiarism detection algorithms. Our experiments are conducted with the four types of paraphrasing in PAN-PC-
11 which are none, low, high and simulated obfuscation paraphrasing types. We defned various threshold and
parameter settings in order to assess the diversity of our results. We compared the performance of our method
with the plagiarism detectors in the 3rd International Competition on Plagiarism Detection (PAN11). According
to the experimental results, the proposed method achieved the best performance in terms of plagdet measure in
the types of high and low obfuscation paraphrasing and produced competitive results in the other paraphrasing
types.
1. Introduction
In recent years, with the growth of data on the web, people can ac-
cess information easily. It is possible to prepare papers, assignments or
reports using the simple copy and paste method in a very short time.
Hence, it has become easier to create a new document in any subject by
copying sections from various sources on the web (S´ anchez-Vega,
Villatoro-Tello, Montes-y-G´ omez, Villase˜ nor-Pineda, & Rosso, 2013).
This has led to the existence of duplicate or multiple documents that
have same or close content in a big database (Varol & Hari, 2015). The
wide-spread usage of copying without citations has increased the cases
of plagiarism.
Plagiarism is defned as the act of taking ideas or expressions from
the writings of others and presenting them as their own
1
. Plagiarism can
be seen in many different forms (Dhir, Arora, & Arora, 2008; Martin,
1994; Gupta & Banda, 2012; Bin-Habtoor & Zaher, 2012; Gipp &
Meuschke, 2011) such as the following:
• Copying whole or some part of a document directly without citing
references,
• Changing the linguistic structure of a document describing someone
else’s ideas,
* Corresponding author.
E-mail addresses: kyalcin@cs.hacettepe.edu.tr (K. Yalcin), ilyas@cs.hacettepe.edu.tr (I. Cicekli), gonenc@cs.hacettepe.edu.tr (G. Ercan).
1
http://www.oxforddictionaries.com/definition/english/plagiarism
Contents lists available at ScienceDirect
Expert Systems With Applications
journal homepage: www.elsevier.com/locate/eswa
https://doi.org/10.1016/j.eswa.2022.116677
Received 30 September 2020; Received in revised form 18 January 2022; Accepted 11 February 2022