A New Context-Aware Measure for Semantic
Distance Using a Taxonomy and a Text Corpus
Ahmad El Sayed, Hakim Hacid, Djamel Zighed
ERIC laboratory – University of Lyon 2
5 avenue Pierre Mend` es-France – 69676 Bron France
Email:{asayed, hhacid, dzighed}@eric.univ-lyon2.fr
Abstract— Having a reliable semantic similarity measure be-
tween words/concepts can have major effect in many fields like
Information Retrieval and Information Integration. A major
lack in the existing semantic similarity measures is that no
one takes into account the actual context or the considered
domain. However, two concepts similar in one context may appear
completely unrelated in another context. In this paper, we present
a new context-based semantic distance. Then, we propose to
combine it with classical approaches dealing with taxonomies and
corpora. Our correlation ratio of 0.89 with human judgments on
a set of words pairs led our approach to outperform all the other
approaches.
I. I NTRODUCTION
Information proliferation is the logical result of advances
in communication, information technologies, and information
processing fields. Indeed, with the Web and the high storage
and processing capabilities of modern tools, which increase
every day, users exchange huge volumes of data. In addition
to that, data are often stored in heterogeneous and independent
sources that can be structured databases as well as unstructured
information such as text files, HTML pages, etc. Recently,
information systems show big interest for the data integration
field since it allows them to retrieve information through
a uniform interface for all data sources instead of using a
different interface for each data source. Most recent works
in information retrieval and data integration have focused
on using ontologies as a tool for representing ”local” data
sources, and then studying similarities between objects, across
different/same source(s), in order to achieve one ”global”
representation for the whole information. However, as integra-
tion has to be meaningful, objects have to be ”semantically”
analyzed and compared.
Comparing two objects relevantly is still one of the biggest
challenges and it now concerns a wide variety of areas in
computer science, artificial intelligence and cognitive science.
The end-goal is that our computational models achieve a
certain degree of ”intelligence” that makes them comparable to
human’s intentions over objects. That’s obviously a hard task
especially that two objects sharing any attribute(s) in common
may be related by some abstract ’human-made’ relation.
The simplest illustration of such problem is certainly text
data, which concerns us in this paper. Indeed, two words
lexicographically similar can mean two different things. Again,
two words with different forms can have the same meaning.
Polysemy and synonymy was studied for a long time by
the computational linguistics community. While synonymy
problem is often resolved by a simple dictionary or taxon-
omy, the polysemy problem requires some more sophisticated
techniques for word sense disambiguation.
Beyond managing synonymy and polysemy, many appli-
cations need to measure the degree of semantic similarity
between two words/concepts
1
; let’s mention: Information re-
trieval, question answering, automatic text summarization and
translation, etc. A major lack in existing semantic similarity
methods is that no one takes into account the context or the
considered domain. However, two concepts similar in one
context may appear completely unrelated in another context.
A simple example for that: While blood and heart seem to be
very similar in a general context, they represent two widely
separated concepts in a domain-specific context like medicine.
Thus, our first-level approach is context-dependent. We
present a new method that computes semantic similarity in
taxonomies by considering the context pattern of the text cor-
pus. In fact, taxonomies and corpora are interesting resources
to exploit. We believe that each one has its strength and
weakness, but using them both simultaneously can provide
semantic similarities with multiple views on words from
different angles. We propose to combine both methods in our
second-level multisource approach to improve the expected
performances.
The rest of this paper is organized as follows: Section 2
introduces quickly some semantic similarity measures. Our
contribution dealing with a context-dependent similarity mea-
sure is described in Section 3. Section 4 presents the exper-
iments made to evaluate and validate the proposed approach.
We conclude and give some future works in Section 5.
II. SEMANTIC SIMILARITY IN TEXT
We can distinguish two categories of semantic similarity
measures related to text data
2
: Knowledge based measures and
corpus based measures.
A. Knowledge-based Measures
Knowledge representation is usually the result of a collab-
orative human effort to represent generic or domain specific
1
In the following, ’words’ is used when dealing with text corpora and
’concepts’ is used when dealing with taxonomies where each concept contains
a list of words holding certain sense
2
We will not detail the different formulas of the methods since we have
length constraints. The readers are invited to follow the references for details.
279 1-4244-1500-4/07/$25.00 ©2007 IEEE