Choosing Document Structure Weights Page 1 Choosing Document Structure Weights ANDREW TROTMAN Department of Computer Science, University of Otago, PO Box 56, Dunedin, New Zealand andrew@cs.otago.ac.nz Abstract Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure. Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting. Keywords: Structured Information Retrieval, Genetic Algorithms, Vector Space Model, Probability Model. 1. Introduction Not all parts of a document are created equal. For academic papers, authors are asked to write a few words that concisely describe their work. This is the title. They are asked to write a few paragraphs that outline their work, the abstract. They are asked to write a few pages that precisely describe the work, the body text. Finally they are asked to summarize the work with a conclusion. These are very different and unequal parts of the same document. Vector space (Salton, Wong, & Yang, 1975) and probabilistic (Robertson & Sparck-Jones, 1976) IR systems rank documents without regard to term location. A term found in an abstract is of equal importance to the same term in the body text of the same document. The document structure is ignored even though authors write documents with structure. A document may even be originated with explicit structure in a mark-up language like XML (Bray, Paoli, & Sperberg-McQueen, 1988), structure discarded when indexing. There is a mismatch: documents have structure, yet the IR system ignores it. Document structure should be utilized in ranking. For example, knowledge in an abstract is denser than elsewhere, fulfilling the principle of summarization. This principle can be applied to ranking. Terms should receive a weighting based on where in the document they occur. In other words, if a term occurs in an abstract it should be weighted as such. Fuller et al. (1993) first suggested structure weighting in 1993. Since then the probability model has been extended to include structure weighting (Wolff, Flörke, & Cremers, 2000), IR query languages have been extended to allow the user the option of choosing the weights (Fuhr & Großjohann, 2001), and index structures have been proposed (Schlieder & Meuss, 2002). One remaining problem is the choice of the weights. If users don’t specify the weights themselves, the IR system should default to a good set of cross corpus weights. But how can those weights be selected? In this investigation genetic algorithms (GA) are used. Each document structure is represented in a chromosome in a GA learning simulation. Selective pressure is then applied to maximise mean average precision. Experiments with the TREC (Harman, 1993) Wall Street Journal (WSJ) collection using structured variants of inner product, probability, and Okapi BM25 (Robertson, Walker, Beaulieu, Gatford, & Payne, 1995) are conducted. Results demonstrate an about 5% improvement in mean average precision for vector space and probability, while no improvement is shown for BM25.