P. Perner and A. Rosenfeld (Eds.): MLDM 2003, LNAI 2734, pp. 425–438, 2003.
© Springer-Verlag Berlin Heidelberg 2003
A Machine Learning Model for Information Retrieval
with Structured Documents
Benjamin Piwowarski and Patrick Gallinari
LIP6 – Université Paris 6, 8, rue du capitaine Scott, 75015 Paris, France
{bpiwowar,gallinar}@poleia.lip6.fr
Abstract. Most recent document standards rely on structured representations.
On the other hand, current information retrieval systems have been developed
for flat document representations and cannot be easily extended to cope with
more complex document types. Only a few models have been proposed for
handling structured documents, and the design of such systems is still an open
problem. We present here a new model for structured document retrieval which
allows to compute and to combine the scores of document parts. It is based on
bayesian networks and allows for learning the model parameters in the presence
of incomplete data. We present an application of this model for ad-hoc retrieval
and evaluate its performances on a small structured collection. The model can
also be extended to cope with other tasks such as interactive navigation in struc-
tured documents or corpus.
1 Introduction
With the expansion of the Web and of large textual resources like e.g. electronic li-
braries, appeared the need for new textual representations allowing interoperability
and providing rich document descriptions. Several structured document representa-
tions and formats were then proposed during the last few years together with descrip-
tion languages like e.g. XML. For electronic libraries, Web documents, and other
textual resources
1
, structured representations are now becoming a standard. This al-
lows for richer descriptions with the incorporation of metadata, annotations, multime-
dia information, etc. Document structure is an important source of evidence, and in
the IR community some authors have argued that it should be considered together
with textual content for information access tasks [1]. This is a natural intuitive idea
since human understanding of documents heavily relies on their structure. Structured
representations allow capturing relations between document parts as it is the case for
books or scientific papers. Information retrieval engines should be able to cope with
the complexity of new document standards so as to fully exploit the potential of these
representations and to provide new functionalities for information access. For exam-
ple, users may need to access some specific document part, navigate through complex
documents or structured collections; queries may address both metadata and textual
content. On the other side, most current information retrieval systems still rely on
1
See for example the DocBook standard [18]