P. Perner and A. Rosenfeld (Eds.): MLDM 2003, LNAI 2734, pp. 425–438, 2003. © Springer-Verlag Berlin Heidelberg 2003 A Machine Learning Model for Information Retrieval with Structured Documents Benjamin Piwowarski and Patrick Gallinari LIP6 – Université Paris 6, 8, rue du capitaine Scott, 75015 Paris, France {bpiwowar,gallinar}@poleia.lip6.fr Abstract. Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. Only a few models have been proposed for handling structured documents, and the design of such systems is still an open problem. We present here a new model for structured document retrieval which allows to compute and to combine the scores of document parts. It is based on bayesian networks and allows for learning the model parameters in the presence of incomplete data. We present an application of this model for ad-hoc retrieval and evaluate its performances on a small structured collection. The model can also be extended to cope with other tasks such as interactive navigation in struc- tured documents or corpus. 1 Introduction With the expansion of the Web and of large textual resources like e.g. electronic li- braries, appeared the need for new textual representations allowing interoperability and providing rich document descriptions. Several structured document representa- tions and formats were then proposed during the last few years together with descrip- tion languages like e.g. XML. For electronic libraries, Web documents, and other textual resources 1 , structured representations are now becoming a standard. This al- lows for richer descriptions with the incorporation of metadata, annotations, multime- dia information, etc. Document structure is an important source of evidence, and in the IR community some authors have argued that it should be considered together with textual content for information access tasks [1]. This is a natural intuitive idea since human understanding of documents heavily relies on their structure. Structured representations allow capturing relations between document parts as it is the case for books or scientific papers. Information retrieval engines should be able to cope with the complexity of new document standards so as to fully exploit the potential of these representations and to provide new functionalities for information access. For exam- ple, users may need to access some specific document part, navigate through complex documents or structured collections; queries may address both metadata and textual content. On the other side, most current information retrieval systems still rely on 1 See for example the DocBook standard [18]