Embedding Term Similarity and Inverse Document Frequency into a Logical Model of Information Retrieval David E. Losada Intelligent System Group, Department of Electronics and Computer Science, University of Santiago de Compostela, Campus Sur, 15782 Santiago de Compostela, Spain. E-mail: dlosada@usc.es Alvaro Barreiro AILab, Department of Computer Science, University of A Corun ˜ a, Campus de Elvin ˜ a, 15071, A Corun ˜ a, Spain. E-mail: barreiro@dc.fi.udc.es We propose a novel approach to incorporate term sim- ilarity and inverse document frequency into a logical model of information retrieval. The ability of the logic to handle expressive representations along with the use of such classical notions are promising characteristics for IR systems. The approach proposed here has been effi- ciently implemented and experiments against test col- lections are presented. Introduction Logical models of Information Retrieval (IR) represent documents and queries as logical formulas, and apply some form of inference provided by the logic to decide relevance. The simplest approach is to model the relevance test using the logical entailment d | = q, where d and q are logical representations of a document and a query, respectively. Nevertheless, it is well known that this criterion is too strict because it cannot deal with partial relevance (van Rijsber- gen, 1986). We focus on a logical approach that estimates the uncertainty of d | = q using measures of distances between logical interpretations defined within a Belief Re- vision (BR) process (Losada, 2001; Losada & Barreiro, 1999, 2001b). The test d | = q simply checks whether or not the set of models of the document is a subset of the set of models of the query, leading to a binary relevance test. In Losada and Barreiro (1999), a more elaborated matching process was proposed. Basically, a measure of distance from each model of the document to the set of models of the query is defined. Following this definition the distance be- tween a document and a query is measured as the average distance from models of the document to the set of models of the query. This distance is used to build a ranking of documents in terms of their distance from the query. This model is able to represent binary-weighted vectors and, in this case, the matching function corresponds to the inner product query-document matching function. Furthermore, the framework can handle representations that are more expressive than classical ones and experiments conducted against classical collections (Losada & Barreiro, 2001a, 2001c) showed large improvements in retrieval perfor- mance when storing expressive formulas. However, given two logical interpretations, their match- ing score is simply determined by the number of terms in common. This means that all the common letters are con- sidered equally good and, on the other hand, all the differing letters are considered equally bad. Because we are defining a model for IR, we could take benefit from additional information that is peculiar to this application domain. In this work we propose an extension of the model proposed in Losada and Barreiro (1999) to cope with term similarity and inverse document frequency (idf). The logical framework keeps being the same suggested in Losada and Barreiro (1999), but the estimation of how much d | = q takes into account those notions. Research in logical models of IR should give more weight to practical issues. In particular, we strongly believe that the statistical techniques used in classical IR should be merged with Knowledge Representation (KR) methods to build IR models able to capture the IR problem in a better way. There are two major challenges when merging IR and knowledge-based methods (Croft, 1993), namely: (a) to produce efficient implementations and (b) to apply the re- sulting models into general domains. In this respect, we have taken great care of the computational complexity of the model proposed here and we provide an efficient imple- mentation. Moreover, because the formalism is simple enough, we have articulated some methods to extract logical representations from classical IR test collections. This © 2003 Wiley Periodicals, Inc. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 54(4):285–301, 2003