Performance of ontology-based semantic similarities in clustering Montserrat Batet 1 , Aida Valls 1 , Karina Gibert 2 1 Department of Computer Science and Mathematics Universitat Rovira i Virgili. Avda. Països Catalans, 26. 43007 Tarragona, Spain {montserrat.batet,aida.valls}@urv.cat 2 Department of Statistics and Operations Research Universitat Politècnica de Catalunya Campus Nord, Ed.C5, c/ Jordi Girona 1-3, E-08034 Barcelona, Spain karina.gibert@upc.edu Abstract. Traditionally, clustering was applied on numerical and categorical information. However, textual information is acquiring an increasing importance with the appearance of methods for textual data mining. This paper proposes the use of classical clustering algorithms with a mixed function that combines numerical, categorical and semantic features. The content of the semantic features is extracted from textual data. As the semantic features must be compared using a semantic similarity function and several measures have been developed, this paper analyses and compares the behavior of some of them using WordNet as background ontology. The different partitions obtained are compared to human classifications in order to see which one approximates better the human reasoning. Moreover, the interpretability of clusters obtained is discussed. The results show that those similarity measures that provide better results when compared using a standard benchmark also provide better and more interpretable partitions. Keywords: Clustering, semantic similarity, ontologies. 1 Introduction Clustering plays an important role in data mining. It is widely used for partitioning data into a certain number of homogeneous groups or clusters [14]. Traditionally, this data mining technique has been applied to numerical and categorical values. However, nowadays textual data mining deserves more and more attention in order to exploit the information available in electronic texts or in the Web [4]. New variables, denoted as semantic features, can be used to describe the objects. The value of a semantic feature is a linguistic term that can be semantically interpreted using some additional knowledge, such as an ontology. Developing clustering techniques for heterogeneous data bases including numerical and categorical features together with features representing conceptual descriptions of the objects is a new field of study. According to this statement, we have designed and implemented a clustering method that can deal with numerical, categorical and semantic features to generate a hierarchical classification of a set of objects. The method calculates the contribution of each type of feature independently and according to each type of value, then the partial similarities are aggregated into a unique value using a mixing function. In a previous paper, we compared the interpretability and quality of the clusters obtained if the semantic features were treated as categorical ones (were each word was treated as a simple modality) with respect to appropriately considering their meaning using a semantic similarity measure. The conclusion was that the consideration of the semantics of the concepts in those features improves the quality of the clustering because the clusters have a clearer conceptual interpretation [2]. However, in the area of computational linguistics, many approaches have been proposed to compute the semantic similarity. Usually, this similarity computation is based on the estimation of the semantic evidence observed in some additional knowledge representation model, in general an ontology [13,6]. The performance of these semantic similarity proposals has been evaluated in different studies [11, 7, 5] by comparing human evaluations of the similarity with the computerized results in a given set of word pairs [9]. Our hypothesis is that those semantic similarity measures that provide best results comparing pairs of terms in a standard benchmark will also provide more accurate clusters when they are used to compute similarities inside a clustering method. In this paper, we will study the performance of ontology-based similarity measures when they are used in the classical Ward’s clustering algorithm [12]. In our experiments, the ontology for assessing the