Expert Systems With Applications 197 (2022) 116677 Available online 17 February 2022 0957-4174/© 2022 Elsevier Ltd. All rights reserved. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding Kadir Yalcin * , Ilyas Cicekli , Gonenc Ercan Department of Computer Engineering, Hacettepe University, 06800 Beytepe, Ankara, Turkey A R T I C L E INFO Keywords: Plagiarism detection part-of-speech (POS) tagging N-grams Semantic similarity Word embedding ABSTRACT The aim of this paper is to present an automatic plagiarism detection system to identify plagiarized passages of documents. Our plagiarism detection system uses both syntactic and semantic similarities to identify plagiarized passages. Our proposed method is a novel contribution because of its usage of part-of-speech tag n-grams (POSNG) which are able to show syntactic similarities between source and suspicious sentences. Each source document is indexed according to part-of-speech (POS) tag n-grams by a search engine in order to access rapidly to sentences that are possible plagiarism candidates. Even though our plagiarism detection system obtains very good results just using POS tag n-grams, its performance is further improved with the usage of semantic simi- larities. The semantic relatedness between words is measured with the word embedding technique called Word2Vec and the longest common subsequence approach is used to measure the semantic similarity between source and suspicious sentences. There are several types of plagiarism such as verbatim, paraphrasing, source- code, and cross-lingual. The high obfuscation paraphrasing is a type of plagiarism and its detection is one of the most diffcult plagiarism detection tasks. Our proposed method, which is based on POS tag n-grams, improves the detection performance of the high obfuscation paraphrasing type and is the main contribution of this paper. For this study, we use the large dataset called PAN-PC-11 which is created for the evaluation of automatic plagiarism detection algorithms. Our experiments are conducted with the four types of paraphrasing in PAN-PC- 11 which are none, low, high and simulated obfuscation paraphrasing types. We defned various threshold and parameter settings in order to assess the diversity of our results. We compared the performance of our method with the plagiarism detectors in the 3rd International Competition on Plagiarism Detection (PAN11). According to the experimental results, the proposed method achieved the best performance in terms of plagdet measure in the types of high and low obfuscation paraphrasing and produced competitive results in the other paraphrasing types. 1. Introduction In recent years, with the growth of data on the web, people can ac- cess information easily. It is possible to prepare papers, assignments or reports using the simple copy and paste method in a very short time. Hence, it has become easier to create a new document in any subject by copying sections from various sources on the web (S´ anchez-Vega, Villatoro-Tello, Montes-y-G´ omez, Villase˜ nor-Pineda, & Rosso, 2013). This has led to the existence of duplicate or multiple documents that have same or close content in a big database (Varol & Hari, 2015). The wide-spread usage of copying without citations has increased the cases of plagiarism. Plagiarism is defned as the act of taking ideas or expressions from the writings of others and presenting them as their own 1 . Plagiarism can be seen in many different forms (Dhir, Arora, & Arora, 2008; Martin, 1994; Gupta & Banda, 2012; Bin-Habtoor & Zaher, 2012; Gipp & Meuschke, 2011) such as the following: Copying whole or some part of a document directly without citing references, Changing the linguistic structure of a document describing someone elses ideas, * Corresponding author. E-mail addresses: kyalcin@cs.hacettepe.edu.tr (K. Yalcin), ilyas@cs.hacettepe.edu.tr (I. Cicekli), gonenc@cs.hacettepe.edu.tr (G. Ercan). 1 http://www.oxforddictionaries.com/definition/english/plagiarism Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa https://doi.org/10.1016/j.eswa.2022.116677 Received 30 September 2020; Received in revised form 18 January 2022; Accepted 11 February 2022