Information retrieval in structured domains
Vincent W. L. Tam and John Shepherd
School of Computer Science and Engineering
University of New South Wales,
UNSW Sydney, NSW 2052, Australia
vincetam@cse.unsw.edu.au and jas@cse.unsw.edu.au
Abstract
In this work, we investigate utilizing the structure of a
website to increase the effectiveness of document
retrieval within a structured domain. In particular we
examine various methods to combine evidence within the
website in order to improve the quality of pages returned.
.
Keywords: Information retrieval, Structured IR, Passage
retrieval
1 Introduction
Information retrieval is a broadly studied topic.
Significant research efforts have been focused on
document retrieval from World Wide Web. We aim to
refine document retrieval within a website by improving
the quality of document relevance against queries. We
achieve this by taking into consideration the evidence
collected from pages that are related to the document
under inspection.
Websites are normally organised according to some
structure (based on an information architecture) to make
it more convenient for users to navigate the site. Often,
the URL structure of pages reflects this organisation.
These observations raise the issue of whether we can
make use of the structure/organisation to improve search.
The work in this paper sets out to explore this issue by
trying to answer the following questions
1. Do surrounding pages of articles carry useful
information to improve the quality of results in
ranking documents against queries?
2. How to define the set of related pages for the
above purpose and how to define the range of
this set?
Our approach to answering these questions was to
conduct information retrieval experiments on websites
that were known to conform to a well-defined,
hierarchical structure. The goal of these experiments was
to determine how to use the information in related pages
to improve relevance scoring. Such experiments, of
course, could prove only that the approach is effective for
sites that follow this structure.
Copyright © 2009, Australian Computer Society, Inc. This
paper appeared at the 20th Australasian Database Conference
(ADC 2009), Wellington, New Zealand. Conferences in
Research and Practice in Information Technology (CRPIT),
Vol. 92. A. Bouguettaya, X. Lin, Eds. Reproduction for
academic, not-for profit purposes permitted provided this text is
included.
2 Related Works
In this section we review several streams of research that
motivated our experiment.
2.1 URL Structure
URLs of web pages have already been used to improve
retrieval results. Keywords in URLs usually provide hints
for information retrieval, and this has been utilized in
search engines (and by “search engine optimisers”) to
enhance rankings of retrieved pages. There are also
known uses of URLs as evidence to categorize pages in
websites (e.g. Kules, Kustanowitz and Shneiderman
2006, Shih, and Karger 2004) where the pages are
organised into hierarchies of subjects within the website.
Under this assumption, URLs of web pages provide
information on how the pages are categorized. Clearly,
not all websites follow such conventions (e.g. many of
the increasing number of dynamically-generated
websites). However, a sufficient number of websites are
organised by URL to make it worthwhile to consider this
approach.
2.2 XML Element retrieval
A major stream of research that is related to our work is
information retrieval in structured documents. This
research focuses mainly on text retrieval from XML
documents. XML documents are well-structured articles
with tags to define elements within the articles.
Information retrieval from XML documents aims to
retrieve elements that closely match the queries. This
stream of research is inspired by the Initiative for the
Evaluation of XML retrieval (INEX). Our work differs
from XML retrieval in that our targeted documents are
individual pages within a website instead of elements
contained in articles. Elements contained in articles have
a well-defined unit (the article) to draw information about
the context of the elements from (e.g. Kimelfeld, Kovacs,
Sagiv and Yahv 2007). On the other hand, there is no
clear boundary for this part-whole relationship for web
pages. The range and number of documents to be
included as related pages is not well-defined. To identify
such boundaries was part of our research objective. A
second difference between our work and XML retrieval is
that queries in XML retrieval can specify the context of
the desired results via Xpath. This is the case if the
schema of the XML documents is known beforehand (e.g.
Beigbeder 2007, Carpineto, Romano and Caracciolo
2007). Besides, the element tags of XML documents
carry additional information for retrieval in the form of
element attributes and element names. This helps in