A New Context-Aware Measure for Semantic Distance Using a Taxonomy and a Text Corpus Ahmad El Sayed, Hakim Hacid, Djamel Zighed ERIC laboratory – University of Lyon 2 5 avenue Pierre Mend` es-France – 69676 Bron France Email:{asayed, hhacid, dzighed}@eric.univ-lyon2.fr Abstract— Having a reliable semantic similarity measure be- tween words/concepts can have major effect in many fields like Information Retrieval and Information Integration. A major lack in the existing semantic similarity measures is that no one takes into account the actual context or the considered domain. However, two concepts similar in one context may appear completely unrelated in another context. In this paper, we present a new context-based semantic distance. Then, we propose to combine it with classical approaches dealing with taxonomies and corpora. Our correlation ratio of 0.89 with human judgments on a set of words pairs led our approach to outperform all the other approaches. I. I NTRODUCTION Information proliferation is the logical result of advances in communication, information technologies, and information processing fields. Indeed, with the Web and the high storage and processing capabilities of modern tools, which increase every day, users exchange huge volumes of data. In addition to that, data are often stored in heterogeneous and independent sources that can be structured databases as well as unstructured information such as text files, HTML pages, etc. Recently, information systems show big interest for the data integration field since it allows them to retrieve information through a uniform interface for all data sources instead of using a different interface for each data source. Most recent works in information retrieval and data integration have focused on using ontologies as a tool for representing ”local” data sources, and then studying similarities between objects, across different/same source(s), in order to achieve one ”global” representation for the whole information. However, as integra- tion has to be meaningful, objects have to be ”semantically” analyzed and compared. Comparing two objects relevantly is still one of the biggest challenges and it now concerns a wide variety of areas in computer science, artificial intelligence and cognitive science. The end-goal is that our computational models achieve a certain degree of ”intelligence” that makes them comparable to human’s intentions over objects. That’s obviously a hard task especially that two objects sharing any attribute(s) in common may be related by some abstract ’human-made’ relation. The simplest illustration of such problem is certainly text data, which concerns us in this paper. Indeed, two words lexicographically similar can mean two different things. Again, two words with different forms can have the same meaning. Polysemy and synonymy was studied for a long time by the computational linguistics community. While synonymy problem is often resolved by a simple dictionary or taxon- omy, the polysemy problem requires some more sophisticated techniques for word sense disambiguation. Beyond managing synonymy and polysemy, many appli- cations need to measure the degree of semantic similarity between two words/concepts 1 ; let’s mention: Information re- trieval, question answering, automatic text summarization and translation, etc. A major lack in existing semantic similarity methods is that no one takes into account the context or the considered domain. However, two concepts similar in one context may appear completely unrelated in another context. A simple example for that: While blood and heart seem to be very similar in a general context, they represent two widely separated concepts in a domain-specific context like medicine. Thus, our first-level approach is context-dependent. We present a new method that computes semantic similarity in taxonomies by considering the context pattern of the text cor- pus. In fact, taxonomies and corpora are interesting resources to exploit. We believe that each one has its strength and weakness, but using them both simultaneously can provide semantic similarities with multiple views on words from different angles. We propose to combine both methods in our second-level multisource approach to improve the expected performances. The rest of this paper is organized as follows: Section 2 introduces quickly some semantic similarity measures. Our contribution dealing with a context-dependent similarity mea- sure is described in Section 3. Section 4 presents the exper- iments made to evaluate and validate the proposed approach. We conclude and give some future works in Section 5. II. SEMANTIC SIMILARITY IN TEXT We can distinguish two categories of semantic similarity measures related to text data 2 : Knowledge based measures and corpus based measures. A. Knowledge-based Measures Knowledge representation is usually the result of a collab- orative human effort to represent generic or domain specific 1 In the following, ’words’ is used when dealing with text corpora and ’concepts’ is used when dealing with taxonomies where each concept contains a list of words holding certain sense 2 We will not detail the different formulas of the methods since we have length constraints. The readers are invited to follow the references for details. 279 1-4244-1500-4/07/$25.00 ©2007 IEEE