Bayesian Networks for Structured Information Retrieval Benjamin Piwowarski LIP 6, Paris, France bpiwowar@poleia.lip6.fr Huyen-Trang Vu LIP 6, Paris, France vu@poleia.lip6.fr Patrick Gallinari LIP 6, Paris, France gallinar@poleia.lip6.fr ABSTRACT We present a bayesian framework for XML document re- trieval. This framework allows us to consider content only. We perform the retrieval task using inference in our net- work. Our model can adapt to a specific corpus through parameter learning and uses a grammar to speed up the retrieval process in big or distributed databases. We also experimented list filtering to avoid element overlap in the retrieved element list. Keywords Bayesian networks, INEX, XML, Focused retrieval, Struc- tured retrieval 1. INTRODUCTION The goal of our model is to provide a generic system for performing different IR tasks on collections of structured documents. We take an IR approach to this problem. We want to retrieve specific relevant elements from the collection as an answer to a query. The elements may be any document or document part (full document, section(s), paragraph(s), ...) indexed from the structural description of the collection. We consider the task as a focused retrieval, first described in [1, 7]. The aim of the INEX (Initiative for the Evaluation of XML retrieval) initiative is to provide means, in the form of a large testbed (test collection) and appropriate scoring methods, for the evaluation of retrieval of XML documents. Among the different INEX tasks, we focused on free text queries (CO for Content Only) since many questions still remain open for this specific task. The Bayesian Network (BN) model is is briefly described in section 2.1. 2. MODELS The generic BN model used for the CO task was described in [8]. We only give here the main model characteristics. Our work is an attempt to develop a formal model for struc- tured document access. Our model relies on bayesian net- works and provides an alternative to other specific approaches for handling structured documents [6, 3, 4]. BN offer a gen- eral framework for taking into account relation dependen- cies between different structural elements. Those elements, which we call doxels (for Document Element) will be random variables in our BN. We believe that this approach allows casting different ac- cess information tasks into a unique formalism, and that these models allow performing sophisticated inferences, e.g. they allow to compute the relevance of different document parts in the presence of missing or uncertain information. Compared to other approaches based on BN, we propose a general framework which should adapt to different types of structured documents or collections. Another original as- pect of our work is that model parameters are learnt from data. This allows to rapidly adapt the model to different document collections and IR tasks. Compared to the model presented in [8], we have proceeded to different additions: • We experimented with different weighting schemes for terms in the different doxels. Weight importance may be relative to the whole corpus of documents, to doxels labelled with the same tag, ...; • We introduced a grammar for modelling different con- straints on the possible relevance values of doxels in a same path ; • For limiting the overlap of retrieved doxels, we intro- duced simple filtering techniques. 2.1 Bayesian networks The BN structure we used directly reflects the document hierarchy, i.e. we consider that each structural part within that hierarchy as an associated random variable. The root of the BN is thus a ”corpus” variable, its children the ”journal collection” variables, etc. In this model, due to the condi- tional independence property of the BN variables, relevance is a local property in the following sense: if we know that the journal is (not) relevant, the relevance value of the journal collection will not bring any new information on the rele- vance of one article of this journal. In our model, the random variable associated to a struc- tural element can take three different values in the set V = {N, G, E} which is related to the specificity dimension of the INEX’03 assessment scale: N (for Not relevant) when the element is not relevant; G (for too biG) when the element is marginally or fairly specific; E (for Exact) when the element has an high specificity.