A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1 , Mark J. Carman 2 , and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk 2 Faculty of Informatics, University of Lugano, Lugano, Switzerland {mark.carman, fabio.crestani}@lu.unisi.ch Abstract. The aim of query-based sampling is to obtain a sufficient, representative sample of an underlying (text) collection. Current mea- sures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of finer granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collec- tion. If these themes are not captured, then resource selection will be affected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be affected by samples which do not reflect the topical density of a collection. To address this issue we pro- pose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the collection and inferred in the sample using latent Dirichlet alloca- tion. The paper outlines an analysis and evaluation of this methodology across a number of collections and sampling algorithms. 1 Introduction Distributed information retrieval (DIR) [1], also known as federated search or selective meta-searchering [8], links multiple search engines into a single, virtual information retrieval system. DIR encompasses a body of research investigating alternative solutions for searching online content that cannot be readily accessed through standard means such as content crawling or harvesting. This content is often referred to as the deep or hidden-web, and includes information that cannot be accessed by crawling the hyperlink structure. A DIR system integrates multiple searchable online resources 3 into a single search service. For cooperative collections, content statistics can be accessed through an agreed protocol. When cooperation from an information resource provider cannot be guaranteed, it is necessary to obtain an unbiased and accurate description of 3 By resource we intend any online information repository that is searchable. This could be a free-text or Boolean search system, relational database, etc. We only make the assumption that a site will have a discoverable search text box.