Embedding Term Similarity and Inverse Document
Frequency into a Logical Model of Information Retrieval
David E. Losada
Intelligent System Group, Department of Electronics and Computer Science, University of Santiago de
Compostela, Campus Sur, 15782 Santiago de Compostela, Spain. E-mail: dlosada@usc.es
Alvaro Barreiro
AILab, Department of Computer Science, University of A Corun ˜ a, Campus de Elvin ˜ a, 15071, A Corun ˜ a, Spain.
E-mail: barreiro@dc.fi.udc.es
We propose a novel approach to incorporate term sim-
ilarity and inverse document frequency into a logical
model of information retrieval. The ability of the logic to
handle expressive representations along with the use of
such classical notions are promising characteristics for
IR systems. The approach proposed here has been effi-
ciently implemented and experiments against test col-
lections are presented.
Introduction
Logical models of Information Retrieval (IR) represent
documents and queries as logical formulas, and apply some
form of inference provided by the logic to decide relevance.
The simplest approach is to model the relevance test using
the logical entailment d | = q, where d and q are logical
representations of a document and a query, respectively.
Nevertheless, it is well known that this criterion is too strict
because it cannot deal with partial relevance (van Rijsber-
gen, 1986). We focus on a logical approach that estimates
the uncertainty of d | = q using measures of distances
between logical interpretations defined within a Belief Re-
vision (BR) process (Losada, 2001; Losada & Barreiro,
1999, 2001b). The test d | = q simply checks whether or not
the set of models of the document is a subset of the set of
models of the query, leading to a binary relevance test. In
Losada and Barreiro (1999), a more elaborated matching
process was proposed. Basically, a measure of distance
from each model of the document to the set of models of the
query is defined. Following this definition the distance be-
tween a document and a query is measured as the average
distance from models of the document to the set of models
of the query. This distance is used to build a ranking of
documents in terms of their distance from the query. This
model is able to represent binary-weighted vectors and, in
this case, the matching function corresponds to the inner
product query-document matching function. Furthermore,
the framework can handle representations that are more
expressive than classical ones and experiments conducted
against classical collections (Losada & Barreiro, 2001a,
2001c) showed large improvements in retrieval perfor-
mance when storing expressive formulas.
However, given two logical interpretations, their match-
ing score is simply determined by the number of terms in
common. This means that all the common letters are con-
sidered equally good and, on the other hand, all the differing
letters are considered equally bad. Because we are defining
a model for IR, we could take benefit from additional
information that is peculiar to this application domain. In
this work we propose an extension of the model proposed in
Losada and Barreiro (1999) to cope with term similarity and
inverse document frequency (idf). The logical framework
keeps being the same suggested in Losada and Barreiro
(1999), but the estimation of how much d | = q takes into
account those notions.
Research in logical models of IR should give more
weight to practical issues. In particular, we strongly believe
that the statistical techniques used in classical IR should be
merged with Knowledge Representation (KR) methods to
build IR models able to capture the IR problem in a better
way. There are two major challenges when merging IR and
knowledge-based methods (Croft, 1993), namely: (a) to
produce efficient implementations and (b) to apply the re-
sulting models into general domains. In this respect, we
have taken great care of the computational complexity of
the model proposed here and we provide an efficient imple-
mentation. Moreover, because the formalism is simple
enough, we have articulated some methods to extract logical
representations from classical IR test collections. This
© 2003 Wiley Periodicals, Inc.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 54(4):285–301, 2003