Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2011 Effcient top-k approximate subtree matching in small memory Augsten, Nikolaus ; Barbosa, Denilson ; Böhlen, Michael H ; Palpanas, Themis Abstract: We consider the Top-k Approximate Subtree Matching (TASM) problem: fnding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is diffcult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-effcient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to effciently prune subtrees that are above this size threshold. We develop an algorithm based on the prefx ring bufer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefx ring bufer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confrms our analytic results. DOI: https://doi.org/10.1109/TKDE.2010.245 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-56414 Journal Article Accepted Version Originally published at: Augsten, Nikolaus; Barbosa, Denilson; Böhlen, Michael H; Palpanas, Themis (2011). Effcient top-k approximate subtree matching in small memory. IEEE Transactions on Knowledge Data Engineering, 23(8):1123-1137. DOI: https://doi.org/10.1109/TKDE.2010.245