Efficient Fragmentation of Large XML Documents Angela Bonifati 1 and Alfredo Cuzzocrea 2 1 ICAR Inst., National Research Council, Italy bonifati@icar.cnr.it 2 DEIS Dept., University of Calabria, Italy cuzzocrea@si.deis.unical.it Abstract. Fragmentation techniques for XML data are gaining momentum within both distributed and centralized XML query engines and pose novel and unrecognized challenges to the community. Albeit not novel, and clearly inspired by the classical divide et impera principle, fragmentation for XML trees has been proved successful in boosting the querying performance, and in cutting down the memory requirements. However, fragmentation considered so far has been driven by semantics, i.e. built around query predicates. In this paper, we pro- pose a novel fragmentation technique that founds on structural constraints of XML documents (size, tree-width, and tree-depth) and on special-purpose struc- ture histograms able to meaningfully summarize XML documents. This allows us to predict bounding intervals of structural properties of output (XML) fragments for efficient query processing of distributed XML data. An experimental evalu- ation of our study confirms the effectiveness of our fragmentation methodology on some representative XML data sets. 1 Introduction An imminent development of XML processing is undoubtly making it as fast and effi- cient as possible. Query engines for XML are being designed and implemented, with the specific goal of employing indexes to improve their performance [10]. Others [23] employ statistics to cost the most frequently asked queries, or use classical algebraic techniques [16] to optimize query plans. On the other hand, XML query processors suffer from main-memory limitations that prevent them from processing large XML documents. While content-based pred- icates can be used to project down parts of documents, an XML query engine which is parsimonious in resources, may still enable a further resizing of the obtained pro- jection/query results. This may also happen in many resource-critical contexts, such as a distributed database, or a stream processor. The advantages of XML fragmentation are already being proved in an XML query engine [4,5] or in a distributed setting [3]. Fragmentation of XML documents as proposed by the previous works has been based on semantics, whereas in this paper we work out a novel kind of fragmentation, which is orthogonal to the first and is only guided by the structural properties of an XML document. Given an XML document, modeled w.l.g. as a tree, there exist several ways of split- ting it into subtrees, which may be semantically driven or structurally driven. Usu- ally, query processors decides to apply projections and selections beforehand in order R. Wagner, N. Revell, and G. Pernul (Eds.): DEXA 2007, LNCS 4653, pp. 539–550, 2007. c Springer-Verlag Berlin Heidelberg 2007