A DISTANCE FUNCTION TO ASSESS THE SIMILARITY OF WORDS USING ONTOLOGIES Montserrat Batet 1 Aida Valls 1 Karina Gibert 2 1 Universitat Rovira i Virgili, Department of Computer Science and Mathematics Av. Pa¨ ısos Catalans 26, 43007 Tarragona, Spain {montserrat.batet,aida.valls}@urv.cat 2 Universitat Polit` ecnica de Catalunya, Department of Statistics and Operations Research Campus Nord, Ed.C5, c/Jordi Girona 1-3, 08034 Barcelona, Spain karina.gibert@upc.edu Abstract When comparing categorical values, tradi- tional approaches use metrics based only on the matching of the values, obtaining a Boolean result. In this paper, it is proposed to use a measure able to compute the degree of semantic similarity between a pair of terms using an ontology as background knowledge. The presented measure - the Superconcept- based distance - have two main advantages over other approaches based on the exploita- tion of the hierarchical model of ontologies: on one hand, it takes into account the whole hierarchy of concepts in the ontology to as- sess the similarity between a pair of words; on the other hand, this paper proves that this measure fulfills the distance properties. As the paper also reviews, the usual seman- tic similarity measures used in the literature does not fulfill the triangle inequality, which prevents them from being used in some deci- sion making methods. Keywords: Semantic similarity, ontologies, linguistic terms, distance properties. 1 INTRODUCTION Our work is focused on decision making problems that have to deal with variables that take their values in a list of linguistic terms. Differently from categorical variables that have a predefined domain of terms (i.e. modalities), we are facing the case of having a non fixed and large set of possible values. Moreover, no ordering or measurement scale in the values is defined, as traditional linguistic variables do. An example of this type of variables can be languages or hobbies. Traditionally, the comparison between two values in categorical variables is done simply based on the equal- ity/inequality of the words, due to the lack of proper methods for representing the meaning of the terms. Some widely used distance measures for categorical values are the Chi-Squared and the Hamming distance [16]. However, from a semantic point of view, it is pos- sible to establish different degrees of similarity between values (i.e. Italian is more similar to French than to Chinese). Each of these terms is in fact describing a concept, thus, reasoning at a conceptual level should be done in order to calculate an approximation of the similarity. The computation of the semantic similarity between concepts is an active trend in computational linguis- tics. The similarity between a pair of concepts quan- tifies how they are alike based on the estimation of se- mantic evidence observed in some knowledge source. Taxonomies and, more generally ontologies [13] are considered as a graph model in which semantic rela- tions are modeled as links between concepts. In the lit- erature several semantic similarity measures have been proposed. A brief review of those measures will be pre- sented in section 2. In this paper we focus on measures based on the exploitation of the taxonomic relations in ontologies. These measures are based on the com- putation of the minimum path length between a pair of concepts to assess the semantic similarity. Conse- quently, a lot of relevant information is missed because the rest of taxonomic information is not considered. To sort out this problem, in this paper a new mea- sure is presented, which takes into account the com- mon and not common ancestors (i.e. superconcepts) of the two concepts compared. It is called Superconcept- based Distance (SCD). When this comparison between terms is done in the context of decision making, it is worth to know the metrical properties of the measure, because it may have implications on the results that will be obtained. For example, for the particular case of hierarchical clustering, it is interesting to maintain ESTYLF 2010, Huelva, 3 a 5 de febrero de 2010 XV Congreso Español Sobre Tecnologías y Lógica Fuzzy 561