Sci. Int. (Lahore),30(6), 907-914, 2018 ISSN 1013-5316;CODEN: SINTE 8 907 November-December SEMANTIC SIMILARITY MEASURES BETWEEN WORDS: A BRIEF SURVEY Ashraf Ali 1, 2 * , Fayez Alfayez 1 and Hani Alquhayz 1 1 Department of Computer Science and Information, College of Science Al Zulfi 11932, P.O 1221, Majmaah University, Kingdom of Saudi Arabia 2 International Center for Advanced Interdisciplinary Research (ICAIR), G-12/316, Ratiya Marg, Sangam Vihar, New Delhi, India-110062 *Correspondence: a.haider@mu.edu.sa ABSTRACT—The semantic similarity measure is the ability to determine the similarity between various terms such as words, sentences, documents, concepts or instances. The aim of determining the semantic similarity measures between two sets of words is to find the degree of relevance by matching the words, which are conceptually similar but not necessarily lexicographically similar. Semantic similarity measure has great importance in many computer applications related field such as information retrieval, educational system, text summarization and natural language processing (NLP). There are several challenges to compute the semantic similarity between the words such as complexity of natural languages, the ambiguity of words and so on. One of the major challenges is the words are similar in meaning but they are not lexicographically similar. Traditional approaches for computing the semantic similarity is the major obstacle as they are not appropriate for many circumstances, many of the existing traditional approaches fail to deal with the term, which is not covered by synonyms and not able to handle with abbreviations, acronyms, brand names and so on. To overcome these problems, we present and evaluate the various promising methodologies that utilize several kinds of search engine based intelligence to determine the degree of similarity between the words. The objective of these types of methodologies is to utilize an assortment of paradigm including the study of text snippet comparison, frequent pattern finding, co-occurrence measures, trend analysis, and so on. The key objective is to replace the traditional methodologies where necessary. Keywords: Semantic Similarity Measure; NLP; Web Search Engine; Ontologies 1. INTRODUCTION Computing the semantic similarity between words, terms, sentences, texts or statements which is same in meaning but not lexicographically similar is one of the critical tasks which have the major impact in many textual applications [9, 14]. In information retrieval, a similarity measure is used to assign a ranking score between a query and text in the corpus. Applications related to question-answer requires similarity identification between a question-answer. There are several types of ontologies used for computing the semantic similarity such as WordNet [9, 11], SENSUS [17], Cyc [27], UMLS [22], MeSH [24]. The diversity of natural language expressions makes it very difficult to compute the semantically equivalent terms. Whereas many applications have employed certain similarity functions to compute the semantic similarity between terms, most of the traditional approaches solving the problem by using manually compiled dictionaries such as WordNet [6]. The main problem is that a lot of terms (e.g. abbreviations, acronyms, brand names, buzzword etc.) that are not covered by these kinds of dictionaries. As a result, semantic similarity measures which are based on this type of resources cannot be used directly in these cases. On the other hand, Web Search Engine (WSE) based approaches use some form of collective intelligence, which explores the potential and has promising collaborative work to solve the number of problems. We would like to utilize the benefit of the WSE collective intelligence for solving the problems related to the semantic similarity. To perform our experiments, we are going to utilize approaches that based on WSE (e.g. Google, Bing, Yandex, Ask etc). This paper investigates and estimates the various promising approaches of semantic similarity to find out the degree of relevance between words using WSE based collective intelligence. We are mainly concerning those methods, which are able to intelligently measure the similarity between emerging terms and not frequently covered in dictionaries such as the method that consists of using the historical search patterns from WSE [15]. The remainder of this paper organized as follows: Section 2 reviews the various ontologies used for semantic similarity. Section 3 describes the related work. Section 4 describes the WSE based approaches for semantic similarity measure including the review of snippet comparison, page count based co-occurrence measure, frequent pattern finding, and trend analysis. Section 5 presents the statistical evaluation of the present methods using the benchmark of the dataset. Finally, we conclude the paper and presented future works direction of research in Section 6. 2. TYPES OF ONTOLOGIES USED FOR THE SEMANTIC SIMILARITY MEASURES Over the years several types of ontologies available to use and utilized for computing the semantic similarity between the words including general purpose ontologies such as WordNet [9,11], SENSUS [17], Cyc [27] and domain-based ontologies such as UMLS [22] , MeSH [24], and STDS [1]. 2.1 General-purpose Ontologies General purpose ontologies are structured network of concepts that are interconnected by different types of assumption and semantic relations from multiple knowledge domains. These ontologies are developed to provide explicit specifications of general-purpose domains in a machine- readable and understandable format. 2.1.1. WordNet WordNet [11] is a knowledge base in the form of the lexical database that stores the meaning of words and the relationship between them in a conceptually organized hierarchy. It is an online database which includes nouns, verbs, adjectives and adverbs grouped into a logical structure called synset. A synset represents a group of synonymous words that especially represents one underlying concept. A WordNet can be seen as the ontology for natural language terms and can be applied to compute the semantic similarity score. The latest version of WordNet is 3.1 announced in November 2012 and contains 155,287 words organized in