Bayesian Networks for Structured Information Retrieval Benjamin Piwowarski LIP 6, Paris, France bpiwowar@poleia.lip6.fr Huyen-Trang Vu LIP 6, Paris, France vu@poleia.lip6.fr Patrick Gallinari LIP 6, Paris, France gallinar@poleia.lip6.fr ABSTRACT We present a bayesian framework for XML document re- trieval. This framework allows us to consider content only. We perform the retrieval task using inference in our net- work. Our model can adapt to a speciﬁc corpus through parameter learning and uses a grammar to speed up the retrieval process in big or distributed databases. We also experimented list ﬁltering to avoid element overlap in the retrieved element list. Keywords Bayesian networks, INEX, XML, Focused retrieval, Struc- tured retrieval 1. INTRODUCTION The goal of our model is to provide a generic system for performing diﬀerent IR tasks on collections of structured documents. We take an IR approach to this problem. We want to retrieve speciﬁc relevant elements from the collection as an answer to a query. The elements may be any document or document part (full document, section(s), paragraph(s), ...) indexed from the structural description of the collection. We consider the task as a focused retrieval, ﬁrst described in [1, 7]. The aim of the INEX (Initiative for the Evaluation of XML retrieval) initiative is to provide means, in the form of a large testbed (test collection) and appropriate scoring methods, for the evaluation of retrieval of XML documents. Among the diﬀerent INEX tasks, we focused on free text queries (CO for Content Only) since many questions still remain open for this speciﬁc task. The Bayesian Network (BN) model is is brieﬂy described in section 2.1. 2. MODELS The generic BN model used for the CO task was described in [8]. We only give here the main model characteristics. Our work is an attempt to develop a formal model for struc- tured document access. Our model relies on bayesian net- works and provides an alternative to other speciﬁc approaches for handling structured documents [6, 3, 4]. BN oﬀer a gen- eral framework for taking into account relation dependen- cies between diﬀerent structural elements. Those elements, which we call doxels (for Document Element) will be random variables in our BN. We believe that this approach allows casting diﬀerent ac- cess information tasks into a unique formalism, and that these models allow performing sophisticated inferences, e.g. they allow to compute the relevance of diﬀerent document parts in the presence of missing or uncertain information. Compared to other approaches based on BN, we propose a general framework which should adapt to diﬀerent types of structured documents or collections. Another original as- pect of our work is that model parameters are learnt from data. This allows to rapidly adapt the model to diﬀerent document collections and IR tasks. Compared to the model presented in [8], we have proceeded to diﬀerent additions: • We experimented with diﬀerent weighting schemes for terms in the diﬀerent doxels. Weight importance may be relative to the whole corpus of documents, to doxels labelled with the same tag, ...; • We introduced a grammar for modelling diﬀerent con- straints on the possible relevance values of doxels in a same path ; • For limiting the overlap of retrieved doxels, we intro- duced simple ﬁltering techniques. 2.1 Bayesian networks The BN structure we used directly reﬂects the document hierarchy, i.e. we consider that each structural part within that hierarchy as an associated random variable. The root of the BN is thus a ”corpus” variable, its children the ”journal collection” variables, etc. In this model, due to the condi- tional independence property of the BN variables, relevance is a local property in the following sense: if we know that the journal is (not) relevant, the relevance value of the journal collection will not bring any new information on the rele- vance of one article of this journal. In our model, the random variable associated to a struc- tural element can take three diﬀerent values in the set V = {N, G, E} which is related to the speciﬁcity dimension of the INEX’03 assessment scale: N (for Not relevant) when the element is not relevant; G (for too biG) when the element is marginally or fairly speciﬁc; E (for Exact) when the element has an high speciﬁcity.