A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, USA selfsolee@chonbuk.ac.kr croft@cs.umass.edu allan@cs.umass.edu ABSTRACT Typical pseudo-relevance feedback methods assume the top- retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster- based resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. A higher relevance density will result in greater retrieval accuracy, ultimately approaching true relevance feedback. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback. Categories and Subject Descriptors H.3.3 [Information Storage & Retrieval]: Relevance Feedback General Terms Algorithms, Experimentation Keywords Information retrieval, pseudo-relevance feedback, a cluster-based resampling, dominant documents, query expansion 1. INTRODUCTION Most pseudo-relevance feedback methods (e.g., [12,19,7]) assume that a set of top-retrieved documents is relevant and then learn from the pseudo-relevant documents to expand terms or to assign better weights to the original query. This is similar to the process used in relevance feedback, when actual relevant documents are used [23]. But in general, the top-retrieved documents contain noise: when the precision of the top 10 documents (P@10) is 0.5, 5 of them are non-relevant. This is common and even expected in all retrieval models. This noise, however, can result in the query representation “drifting” away from the original query. This paper describes a resampling method using clusters to select better documents for pseudo-relevance feedback. Document clusters for the initial retrieval set can represent aspects of a query on especially large-scale web collections, since the initial retrieval results may involve diverse subtopics for such collections. Since it is difficult to find one optimal cluster, we use several relevant groups for feedback. By permitting overlapped clusters for the top-retrieved documents and repeatedly feeding dominant documents that appear in multiple highly-ranked clusters, we expect that an expansion query can be represented to emphasize the core topics of a query. This is not the first time that clustering has been suggested as an improvement for relevance feedback. In fact, clustering was mentioned in some of the first work related to pseudo-relevance feedback [1]. Previous attempts to use clusters have not improved effectiveness. The work presented here is based on a new approach to using the clusters that produces significantly better results. Our motivation for using clusters and resampling is as follows: the top-retrieved documents are a query-oriented ordering that does not consider the relationship between documents. We view the pseudo-relevance feedback problem of learning expansion terms closely related to a query to be similar to the classification problem of learning an accurate decision boundary, depending on training examples. We approach this problem by repeatedly selecting dominant documents to expand terms toward dominant documents of the initial retrieval set, as in the boosting method for a weak learner that repeatedly selects hard examples to change the decision boundary toward hard examples. The hypothesis behind using overlapped document clusters is that a good representative document for a query may have several nearest neighbors with high similarities, participating in several different clusters. Since it plays a central role in forming clusters, this document may be dominant for this topic. Repeatedly sampling dominant documents can emphasize the topics of a query, rather than randomly resampling documents for feedback. We show that resampling feedback documents based on clusters contributes to higher relevance density for feedback documents on a variety of TREC collections. The results on large-scale web collections such as the TREC WT10g and GOV2 collections show significant improvements over the baseline relevance model. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.