Query-Based Inter-document Similarity Using Probabilistic Co-relevance Model Seung-Hoon Na 1 , In-Su Kang 2 , and Jong-Hyeok Lee 1 1 POSTECH, Pohang, South Korea {nsh1979,jhlee}@postech.ac.kr 2 KISTI, Daejeon, South Korea dbaisk@kisti.re.kr Abstract. Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theo- retical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal inter- document similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based simi- larity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee’s evaluation measure. 1 Introduction The cluster-hypothesis is a widely accepted concept to the community of information retrieval, guiding the study of the cluster-based retrieval [1]. From this, researchers have investigated the study of cluster-based retrievals, implicitly assuming that the inter-document similarity which they use well-satisfies the cluster-hypothesis. Basi- cally, since a retrieval model itself can be directly used to calculate an inter-document similarity, researchers have used the term-based inter-document similarity that the re- trieval model defines, without a serious concern, i.e. one document among two docu- ments is regarded as a query [2,3,4,5]. However, inter-document similarity which is adequate to a cluster-based retrieval may not be a term-based similarity. Without any theory of the inter-document similar- ity, we should not decide whether or not the inter-document similarity is a term-based similarity that a retrieval model defines. This is one of the reasons why we require a theory of an inter-document similarity. Unfortunately, previous works have not investi- gated it. C. Macdonald et al. (Eds.): ECIR 2008, LNCS 4956, pp. 684–688, 2008. c Springer-Verlag Berlin Heidelberg 2008