XML data partitioning schemes for parallel holistic twig joins Imam Machdi, Toshiyuki Amagasa and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan Abstract Purpose – The purpose of this paper is to propose Extensible Markup Language (XML) data partitioning schemes that can cope with static and dynamic allocation for parallel holistic twig joins: grid metadata model for XML (GMX) and streams-based partitioning method for XML (SPX). Design/methodology/approach – GMX exploits the relationships between XML documents and query patterns to perform workload-aware partitioning of XML data. Speciﬁcally, the paper constructs a two-dimensional model with a document dimension and a query dimension in which each object in a dimension is composed from XML metadata related to the dimension. GMX provides a set of XML data partitioning methods that include document clustering, query clustering, document-based reﬁnement, query-based reﬁnement, and query-path reﬁnement, thereby enabling XML data partitioning based on the static information of XML metadata. In contrast, SPX explores the structural relationships of query elements and a range-containment property of XML streams to generate partitions and allocate them to cluster nodes on-the-ﬂy. Findings – GMX provides several salient features: a set of partition granularities that balance workloads of query processing costs among cluster nodes statically; inter-query parallelism as well as intra-query parallelism at multiple extents; and better parallel query performance when all estimated queries are executed simultaneously to meet their probability of query occurrences in the system. SPX also offers the following features: minimal computation time to generate partitions; balancing skewed workloads dynamically on the system; producing higher intra-query parallelism; and gaining better parallel query performance. Research limitations/implications – The current status of the proposed XML data partitioning schemes does not take into account XML data updates, e.g. new XML documents and query pattern changes submitted by users on the system. Practical implications – Note that effectiveness of the XML data partitioning schemes mainly relies on the accuracy of the cost model to estimate query processing costs. The cost model must be adjusted to reﬂect characteristics of a system platform used in the implementation. Originality/value – This paper proposes novel schemes of conducting XML data partitioning to achieve both static and dynamic workload balance. Keywords Extensible Markup Language, Algorithmic languages, Data structures, Worldwide web Paper type Research paper 1. Introduction With the growing popularity of the world wide web, many applications have adopted Extensible Markup Language (XML) as a de facto standard format to represent and exchange different kinds of information. As an illustrative example in Figure 1, an e-Market application system normally handles heterogeneous XML documents and The current issue and full text archive of this journal is available at www.emeraldinsight.com/1744-0084.htm This study has been partially supported by Grant-in-Aid for Young Scientists (B) (19700083), MEXT (19024006), and CREST of Japan Science and Technology Agency. XML data partitioning schemes 151 Received 21 July 2008 Revised 28 February 2009 Accepted 27 March 2009 International Journal of Web Information Systems Vol. 5 No. 2, 2009 pp. 151-194 q Emerald Group Publishing Limited 1744-0084 DOI 10.1108/17440080910968445