(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 140 | Page www.ijacsa.thesai.org Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus Jatinderkumar R. Saini 1 , Prafulla B. Bafna 2 Symbiosis Institute of Computer Studies and Research Symbiosis International Deemed University Pune, India Abstract—Identifying the similarity between two documents is a challenging but important task. It benefits various applications like recommender systems, plagiarism detection and so on. To process any text document one of the popularly used approaches is document term matrix (DTM). The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS). DTMS uses the frequency of the term whereas DSMS uses the frequency of synset instead of term and contributes to the dimension reduction. The proposed approach considers the semantics and context of the corpus to solve the problem of polysemy. More than 760 documents including Subhashitas and stories are processed together. F1 Score, precision, Matthews Correlation coefficient (MCC) which is the most balanced measure and accuracy are used to prove the betterment of the proposed approach. Keywords—Cosine; dimension reduction; sanskrit; synset; matthews correlation coefficient I. INTRODUCTION The degree of matching between two text pieces based on their statistics as well as semantics is termed as the similarity between text pieces [24]. Statistics of the document means the length of the document, tokens present in the document, etc. The semantics of the document means the understanding meaning of the words present in the document. These documents/ text pieces can be in the form of the word, pdf and so on. There are various measures to calculate the similarity between the two documents. Jaccard, cosine similarity and so on. Cosine similarity is independent of the statistics of the document. Cosine similarity calculates the cosine value of the angle between two vectors. These vectors comprise the frequency of words in a multi-dimensional plane. Each word present in the document represents the dimension/feature [10]. Thus the orientation of the text document gets captured by cosine similarity instead of the magnitude only. Cosine similarity [11] is better than other similarity measures eg. Euclidean distance. The value near to ‘1’ indicates the documents are most similar. Cosine value is always between ‘0’ and ‘1’. Calculating cosine similarity between English, Hindi [12-14], Marathi [15] text documents [5] is a common task but processing Sanskrit language [33,30,28] and its morphological analysis [35] are critical tasks, as a result finding out the mapping between Sanskrit language texts is challenging. Sanskrit is assumed to be the mother of every language. Panini has introduced this grammar rich language before 2500 years ago. The Sanskrit language has been the traditional means of communication in Hinduism, Jainism, Buddhism, and Sikhism, still, Sanskrit text mining is an untouched area. Several kinds of literature are available in Sanskrit for eg. stories, subhashits and so on. A subhashita (Sanskrit: ‘सुभाषित’) can be explained as a legendary kind of Sanskrit concise poems to communicate the message of advice, aphorism and so on. Generally, Sanskrit subhashit or stories are related to all aspects of life. Subhashitas are significant in Indian traditional education and are used to teach values like truthfulness, courage, etc. which are applicable for each phase of life righteousness. To extract any information from Sanskrit text, various techniques are used. DTMS is one of the techniques using which different operations can be carried out on Sanskrit corpus. Sanskrit documents are placed in rows and significant terms are placed in columns. The entry in the matrix represents the number of times the particular Sanskrit term occurred in the document. The significance of the term is decided based on the frequency of the term. The semantics of the term is considered in DSMS. DSMS uses synset groups in which semantically similar tokens are grouped. Instead of considering term frequency, synset group frequency is considered. It facilitates to solve the polysemy problem means one word used with different senses. Dimension reduction means the removal of unnecessary features. Several methods are available for dimension reduction like principal component analysis, latent semantic analysis, etc. In the text processing [1][2], Different NLP tasks [5-9] are carried out like removal of stop words [3][4] [32][29] results in dimension reduction. Before stopwords, removal tokenization needs to be carried out for example ‘ततो मषिका उषिय गता.’ meaning ‘The fly flew away’ In this Sanskrit statement ‘ततो’ meaning ‘from there’ is removed after separation of tokens. Tokens of the sentence are ‘ततो’, ‘मषिका’, ‘उषिय’, ‘गता’, ‘.’. Lemmatization coverts words into their meaningful root form [31]. On the formulated document synset matrix, several applications could be built like plagiarism detection, document clustering, etc. Till now no research is carried out to find Sanskrit document similarity using semantics and context. To evaluate machine learning algorithms different parameters are available. Eg. Precision, accuracy, Matthews’s Correlation coefficient. Matthews Correlation coefficient (MCC) is a quality measure for binary classification. It is a