Ultra-succinct Representation of Ordered Trees Jesper Jansson ∗ Kunihiko Sadakane † Wing-Kin Sung ‡ Abstract There exist two well-known succinct representations of ordered trees: BP (balanced parenthesis) [Munro, Raman 2001] and DFUDS (depth first unary degree sequence) [Benoit et al. 2005]. Both have size 2n + o(n) bits for n-node trees, which asymptotically matches the information-theoretic lower bound. Many fundamental operations on trees can be done in constant time on word RAM, for example finding the parent, the first child, the next sibling, the number of descendants, etc. However there has been no single representation supporting every existing operation in constant time; BP does not support i-th child, while DFUDS does not support lca (lowest common ancestor). In this paper, we give the first succinct tree repre- sentation supporting every one of the fundamental op- erations previously proposed for BP or DFUDS along with some new operations in constant time. Moreover, its size surpasses the information-theoretic lower bound and matches the entropy of the tree based on the distri- bution of node degrees. We call this an ultra-succinct data structure. As a consequence, a tree in which every internal node has exactly two children can be repre- sented in n + o(n) bits. We also show applications for ultra-succinct compressed suffix trees and labeled trees. 1 Introduction A succinct data structure is a data structure which stores an object using space close to the information- theoretic lower bound, while simultaneously supporting a number of primitive operations to be performed on the object in constant time. Here the information- theoretic lower bound for storing an object from a ∗ Department of Computer Science and Communication En- gineering, Kyushu University. Motooka 744, Nishi-ku, Fukuoka 819-0395, Japan. jj@tcslab.csce.kyushu-u.ac.jp Supported by JSPS (Japan Society for the Promotion of Science). † Department of Computer Science and Communication En- gineering, Kyushu University. Motooka 744, Nishi-ku, Fukuoka 819-0395, Japan. sada@csce.kyushu-u.ac.jp Work supported in part by the Grant-in-Aid of the Ministry of Education, Science, Sports and Culture of Japan. ‡ School of Computing, National University of Singapore. ksung@comp.nus.edu.sg fixed universe with cardinality L is log L bits 1 because in the worst case this number of bits is necessary to distinguish two distinct objects. For example, that for a subset of the ordered set {1, 2,...,n} is n because there are 2 n different subsets, and that for an ordered tree with n nodes is 2n - Θ(log n) because there exist ( 2n-1 n-1 ) /(2n - 1) = 2 2n /Θ(n 3 2 ) such trees [19]. Typical succinct data structures are the ones for storing ordered sets [23, 25, 24, 13], ordered trees [14, 19, 8, 9, 2, 22, 21, 3, 28], strings [10, 11, 6, 26, 31, 29], functions [21], cardinal trees [2, 5], etc. The size of a succinct data structure storing an object from the universe is typically (1 + o(1)) log L bits 2 . Many fundamental operations on the object can be done in constant time on the word RAM model with word-length Θ(log n), for example, counting the number of elements in a set which are smaller than a given value, finding the parent of a node in a tree, etc. This paper considers succinct data structures for ordered trees. Though there exist many such data structures in the literature, they have the following disadvantages. 1. No single succinct data structure supports all fun- damental operations in constant time; the balanced parenthesis representation [19, 8] (BP) does not support i-th child, while the depth-first unary de- gree sequence representation [2, 9] (DFUDS) does not support lowest common ancestor (lca). 2. Though the space is asymptotically optimal in the worst case, it is not optimal for certain classes of trees. For example, any n-node tree whose internal nodes have exactly two children can be encoded in n bits by writing 1 for internal nodes and 0 for leaves during the depth-first traversal of the tree, whereas both the BP and the DFUDS use 2n bits. These drawbacks cause severe problems for document processing. Now many huge collections of documents are available, for example Web pages and genome sequences. To search such documents we use suffix 1 The base of logarithm is 2 unless specified. We define 0 log 0 = 0. 2 Some papers use a weaker definition of succinctness that allows O(log L) bits. 575