Experiments on Element and Document Statistics for XML Retrieval Mohamed Ben Aouicha, Mohamed Tmar, Mohand Boughanem, and Mohamed Abid Abstract—This paper presents an information retrieval model on XML documents based on tree matching. Queries and documents are represented by extended trees. An extended tree is built starting from the original tree, with additional weighted virtual links between each node and its indirect descendants allowing to directly reach each descendant. Therefore only one level separates between each node and its indirect descendants. This allows to compare the user query and the document with flexibility and with respect to the structural constraints of the query. The content of each node is very important to decide weither a document element is relevant or not, thus the content should be taken into account in the retrieval process. We separate between the structure-based and the content-based retrieval processes. The content-based score of each node is commonly based on the well-known Tf × I df criteria. In this paper, we compare between this criteria and another one we call Tf × Ief . The comparison is based on some experiments into a dataset provided by INEX 1 to show the effectiveness of our approach on one hand and those of both weighting functions on the other. Keywords—XML retrieval, INEX, Tf × I df , Tf × Ief I. I NTRODUCTION E XTENSIBLE Markup Language (XML) [1] is becoming widely used as a standard document format in many application domains. We believe since few years that a great volume of static and dynamic data were produced in XML. Therefore, XML retrieval becomes more and more essential [4]. XML documents covers a big part not only on the web, but also on modern digital libraries, business to business and business to consumer software and essentially on Web services oriented software. This is due to the great importance of structured information. While both text and structure are important, we usually give higher priority to text when ranking XML elements. We adapt unstructured retrieval methods to handle additional structural constraints. Such approach are called text centric XML retrieval. The vector-space based XML retrieval method proposed by [20] defines each dimension by sub-trees that contain at least one indexing term. Queries and documents are then represented by vectors in this space and the M. B. Aouicha is with the Institut de Recherche en Informa- tique de Toulouse, 118 Route de Narbonne, 31062, email: mo- hamed.benaouicha@irit.fr M. Tmar is with the Institut Sup´ erieur d’Informatique et du Multim´ edia de Sfax, Route de Tunis, B.P.: 1030, 3018, email: mohamedtmar@isimf.rnu.tn M. Boughanem is with the Institut de Recherche en Informatique de Toulouse, 118 Route de Narbonne, 31062, email:boughane@irit.fr M. Abid Ecole Nationale d’Ing´ enieurs de Sfax, Route de Soukra, 3038, mohamed.abid@enis.rnu.tn 1 INitiative for the Evaluation of XML retrieval, an evaluation forum that aims at promoting retrieval capabilities on XML documents. system computes matches between them using well-known similarity measures (Cosine, Dice, Overlap ...). Schlieder and Meuss [16] describe similar approaches. They proposed the ApproXQL model, which integrates the document structure in the vector space model similarity measure. The query model is based on tree matching: it rewrites the queries and the documents independently and then performs XML retrieval based on the vector space model basics. Several teams have used a language modeling approach to XML retrieval. Ogilvie and Callan [21] use a tree-based generative language model for ranking documents and components. They build a language model for nodes and another for leaf nodes depending on their components. Inner nodes are estimated using a linear interpolation among the children nodes. The probabilistic model has been applied to XML documents by [19] and [11]. Contrarily to text-centric XML, data-centric XML mainly encodes non-text data. When querying data-centric XML, the user imposes exact match conditions in most cases. This approach is commonly used for data collections with complex structures and non-text data. There are powerful query languages for XML that can efficiently handle structure. The most known of such languages is XQuery [22]. How- ever, it is challenging to implement an XQuery-based typically to provide ranked lists of elements. Amer Yahia [23] uses a pettern matching model based on Xquery to handle XML retrieval. Fuhr [5] uses a query language XIRQL that combines the structural and the content based approach. It integrates features related to data-centric by using ideas from logic- based probabilistic IR models, in combination with concepts from the database area. In this paper, we separate between content and structure since indexing queries and documents [12]. We handle the document structure by retrieving candidate document fragments that almost follow the query structure and then complete their scores by content retrieval. By means of structure, scores are assigned to document fragments, highest scores are assigned to those that have exactly the same structure as the query. Besides, we assume that content retrieval is the main decisive criteria of relevance. Although XML retrieval is based on structure constraints, the content is still the World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:2, No:2, 2008 316 International Scholarly and Scientific Research & Innovation 2(2) 2008 scholar.waset.org/1307-6892/8258 International Science Index, Computer and Information Engineering Vol:2, No:2, 2008 waset.org/Publication/8258