(Research Article) Sentence Level Paraphrase Identification System for Tamil Language Dr. S. V. Kogilavani 1* , Dr. C. S. Kanimozhiselvi 2 , Dr. S. Malliga 3 1*, 2, 3 Department of Computer Science and Engineering, Kongu Engineering College, Erode, Tamil Nadu, INDIA Abstract Automatic detection of the paraphrase is a process which has immense applications like plagiarism detection and new event detection. Paraphrase is the representation of a given fact in more than one way by means of different phrases. Identification of a paraphrase is a classical natural language processing task which is of classification type. The aim is to detect sentence level plagiarism through paraphrase identification of sentences in Tamil. The sentences in Tamil language are processed using Tamil shallow parser. Shallow parsing is used to analyze a sentence to identify Part of Speech of sentences such as nouns, verbs, adjectives etc. Sentences are also processed using word2vec tool to identify word order between sentences. From the output of the shallow parsing process and word2vec, the feature file is constructed where the text values are converted into numerical matrix. This feature file is given as input into machine learning algorithms which in turn classify the sentence pair into paraphrase or not-a-paraphrase. If the result is paraphrase means, that sentence will be considered as plagiarized sentence. The accuracy and performance of these methods are measured based on evaluation parameters like accuracy, precision, recall and f- measure. The analysis based on these performance measures shows that Random Forest method classifies the sentence pair into paraphrase or not-a-paraphrase with high accuracy compared to other methods. Keywords: machine learning algorithm, paraphrase identification, Plagiarism detection, Shallow parser 1. Introduction Paraphrase is the process of identifying whether the two different text have the same meaning or not. Paraphrase identification plays an important role in information retrieval, information extraction, natural language processing and machine translation. To identify paraphrases the similarity between the pair of the sentences are calculated. Sometimes the two sentences may have the same meaning but that can be expressed by different texts. If the two sentences are similar, then words in the two sentences may or may not be similar. Structural relations include relations between words and the distances between words. The similarity between sentences is measured based on statistical information of sentences. The statistical similarity between two sentences is calculated based on word vector using Cosine similarity measures. The semantic similarity between two sentences is calculated based on word order. Plagiarism is defined as considering another person’s content as one’s own work. Plagiarism is not in itself a crime, but can constitute copyright infringement. In academia and industry, it is a serious ethical offense. Plagiarism and copyright overlap to a considerable extent, but they are not equivalent concepts, and many types of plagiarism do not constitute copyright infringement. Paraphrase identification is the task of determining whether two or more sentences represent the same meaning or not. Plagiarism detection is the task which needs the paraphrase identification technique to detect the sentences which are paraphrases of others. If the two sentences are paraphrases of each other, ultimately those sentences are plagiarized sentences. In this way paraphrase identification leads to detect whether plagiarism is there or not in sentence level. The similarity between two Tamil sentences is done through shallow parsing where the basic parts of sentences are identified. Word2Vec tool is used for finding cosine similarity measure and the word order is also calculated. Feature file is constructed using all these values and it is classified as paraphrase or not-a-paraphrase using various classifier algorithms. 2. Literature Review Paolo Rosso [1] analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and INTERNATIONAL JOURNAL OF DARSHAN INSTITUTE ON ENGINEERING RESEARCH AND EMERGING TECHNOLOGIES Vol. 7, No. 1, 2018 www.ijdieret.in IJDI-ERET * Corresponding Author: e-mail: kogilavani.sv@gmail.com ISSN 2320-7590 2018 Darshan Institute of Engg. & Tech., All rights reserved