SliceSort: Efficient Sorting of Hierarchical Data Quoc Trung Tran UC Santa Cruz tqtrung@soe.ucsc.edu Chee-Yong Chan National University of Singapore chancy@comp.nus.edu.sg ABSTRACT Sorting is a fundamental operation in data processing. While the problem of sorting flat data records has been extensively studied, there is very little work on sorting hierarchical data such as XML documents. Existing hierarchy-aware sorting approaches for hierarchical data are based on creating sorted subtrees as initial sorted runs and merging sorted subtrees to create the sorted output using either explicit pointers or absolute node key comparisons for merging subtrees. In this paper, we propose SliceSort, a novel, level-wise sort- ing technique for hierarchical data that avoids the drawbacks of subtree-based sorting techniques. Our experimental per- formance evaluation shows that SliceSort outperforms the state-of-art approach, HErMeS, by up to a factor of 27%. Categories and Subject Descriptors H.2.4 [System]: Query processing Keywords Hierarchical Data, Slicesort, Sorting 1. INTRODUCTION Sorting is a fundamental operation in data processing and techniques to optimize sorting “flat” data have been exten- sively studied for both main-memory and external memory contexts [7, 6]. However, there is very little work on sorting hierarchical data such as XML documents [9, 8]. In a fully sorted hierarchical document, the list of child nodes of every non-leaf node is sorted according to some given criteria (e.g., the key of the child node or some function of the contents in the subtree rooted at the child node). As a simple ex- ample, Figures 1(b) and (c) show an unsorted and a sorted hierarchical data, respectively, where the nodes are sorted alphabetically by their key values given by the node labels. The work was done when the author was at National Uni- versity of Singapore. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’12, October 29–November 2, 2012, Maui, HI, USA. Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00. Hierarchical sorting has application in the archival of sci- entific data, which is predominantly stored in hierarchical data formats. To archive a new data version, an efficient approach is to first sort the new data and then merge it with the existing data version [8, 9]. Example 1. Consider the archival approach proposed in [3] to store multiple versions of hierarchical data. Figure 1(a) shows an example archived document, V1-2, that consists of two data versions. Each node in the document has a node label (e.g., /, A, B) and either an explicit version tag (in- dicated by “t = ...”) or an implicit version tag. A node’s version tag indicates in which version(s) of the document the node is present; for example, node “A” is present in only version 1, node “B” is present in only version 2, and the root node “/”is present in both versions 1 and 2. A node without an explicit version tag has an implicit tag that is inherited from its parent node. For example, nodes “E” and “F”both inherit the version tag from its parent node “A”and they are all present in only version 1; while node “G” inherits its ver- sion tag from node “B” and appears only in version 2. Note that the document V1-2 is hierarchically sorted based on the lexicographical order of its node labels: the child nodes of the root node are sorted with node “A” preceding node “B”, and the child nodes of node “A” are sorted with node “E” preceding node “F”. Consider a new version of the document V3 (shown in Figure 1(b)) to be merged into the archived document V1-2. An efficient approach to merge the documents is to first sort V3 into V 3 (shown in Figure 1(c)) and merge them using a synchronized traversal of the pair of sorted documents [3]. The merged archived document is shown in Figure 1(d). Another application of hierarchical sorting is in change detection of XML documents, which is useful to control the changes in a warehouse with a large volume of XML docu- ments [8]. Detecting changes in such an environment serves many purposes such as versioning, querying the past, and monitoring the changes [4, 5, 10]. Earlier works on change detection in XML documents (e.g., [4, 5, 10]) operate on un- sorted documents that are assumed to be entirely resident in main memory. However, the state-of-the art approaches that can operate on large, disk-based data are based on sorted documents [8]. Hierarchical sorting is also useful for processing batch up- dates to an existing sorted XML document. The idea is to sort the batch of updates and merge it with the existing document [9, 8]. Hierarchical sorting is also useful in the evaluation of or- der by clause in XPath [1] and XQuery [2] that allows the