Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation Dwaipayan Roy dwaipayan_r@isical.ac.in Debasis Ganguly dganguly@computing.dcu.ie Mandar Mitra mandar@isical.ac.in Gareth J.F. Jones gjones@computing.dcu.ie CVPR Unit, Indian Statistical Institute, Kolkata, India ADAPT Centre, School of Computing, Dublin City University, Dublin, Ireland ABSTRACT A limitation of standard information retrieval (IR) models is that the notion of term composionality is restricted to pre- defined phrases and term proximity. Standard text based IR models provide no easy way of representing semantic rela- tions between terms that are not necessarily phrases, such as the equivalence relationship between ‘osteoporosis’ and the terms ‘bone’ and ‘decay’. To alleviate this limitation, we introduce a relevance feedback (RF) method which makes use of word embedded vectors. We leverage the fact that the vector addition of word embeddings leads to a semantic composition of the corresponding terms, e.g. addition of the vectors for ‘bone’ and ‘decay’ yields a vector that is likely to be close to the vector for the word ‘osteoporosis’. Our proposed RF model enables incorporation of semantic re- lations by exploiting term compositionality with embedded word vectors. We develop our model for RF as a gener- alization of the relevance model (RLM). Our experiments demonstrate that our word embedding based RF model sig- nificantly outperforms the RLM model on standard TREC test collections, namely the TREC 6,7,8 and Robust ad-hoc and the TREC 9 and 10 WT10G test collections. Keywords Word Vector Embedding, Word Compositionality, Relevance Feedback, Kernel Density Estimation 1. INTRODUCTION Standard information retrieval (IR) models for text search are based on the mutual term independence assumption. In- corporating representation of term dependencies within the framework of IR is generally expected to improve retrieval effectiveness. These methods to incorporate terms depen- dencies range from representing terms in a reduced dimen- sional space by algebraic or probabilistic approaches [5, 13] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA c 2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00 DOI: http://dx.doi.org/10.1145/2983323.2983750 to making use of generative models for term dependence that are based on word translation models or latent topic models [1, 23]. However, none of these methods provide an easy way of representing semantic relations involving multi- word concepts, such as the semantic equivalence between the term ‘osteoporosis’ and the concept expressed together by the terms ‘bone’ and ‘decay’. The recently developed theory of word embeddings [19], where a word, instead of being treated as a categorical vari- able, is transformed into a vector of real numbers, opens up a new avenue for exploring the benefits of leveraging term compositionality in the context of IR. One of the most pow- erful features of word embedding is that the addition of word vectors corresponds to a semantic composition of the terms. Thus, addition of the vectors for ‘bone’ and ‘decay’ yields a vector that is in close proximity (within the embedded vector space) to the vector for ‘osteoporosis’. Most exist- ing research exploring the use of word embeddings for IR has involved improving the effectiveness of initial retrieval via improved document representations that incorporate se- mantic similarities between terms [22, 9]. For instance, the work in [22] represents documents and queries as composed vectors of their constituent words in order to compute the semantic similarity between them. The compositional char- acteristic of the word vectors has also been used recently to learn the weights of query terms during retrieval [26]. However, there has been very little work which system- atically examines the use of word vector embeddings for relevance feedback (RF) and query expansion (QE). This is an interesting direction for study because the additional information about the semantic relations between potential expansion terms (as captured by the distances between the corresponding vector embeddings) may be utilized to fur- ther improve retrieval effectiveness. In fact, this constitutes the key idea behind our proposed feedback method. The only existing studies, that we are aware of, which explore word embeddings for QE are somewhat ad-hoc in nature. For example, both [10] and [11] simply use the k nearest neighbours of a query word vector as additional query terms for the purpose of medical and advertisement search respec- tively. A major limitation of these approaches is that QE is done prior to the initial retrieval. As a result, these methods have no way of utilizing information which has been shown, in general, to be useful for RF, e.g. the co-occurrence of terms in the query with those in the top ranked documents,