Lower bounds on the size of selection and rank indexes Peter Bro Miltersen * 1 Introduction The rank index problem is the following: Preprocess and store a bit string x ∈{0, 1} n on a random access machine with word size w so that rank queries “What is j i=1 x i ?” for arbitrary values of j can afterwards be easily answered. The selection index problem is the following: Preprocess and store a bit string x {0, 1} n so that selection queries “What is the index of the j ’th 1-bit in x?” for arbitrary values of j can afterwards be easily answered. The data structure representing x should be an index structure, i.e., the n-bit string x is kept verbatim in n/wwords and the preprocessing phase adds an r-bit index φ(x) with additional information contained in r/wwords. We are interested in tradeoffs between r, the size of the index measured in bits (the redundancy of the scheme), and t, the worst case time for answering a query. Upper bounds for the rank and selection index prob- lems were discussed by Jacobson [4], Clark [1], Munro [5], Munro, Raman and Rao [6] and Raman, Raman and Rao [7]. The most relevant case is w = O(log n) and t = O(1). For these parameters, the best known solu- tions for both problems have r = O(n log log n/ log n) (the upper bound for selection indexes being the most recent one, explicit in the journal version of [7] only). Here we show lower bounds for the rank and selec- tion index problems in the cell probe model with word size w. That is, when measuring the query time, we are charged one unit of time each time we access a w- bit register, while computation is for free. Such lower bounds clearly also apply to random access machines with the same word size. We show: Theorem 1.1. For any index for selection queries us- ing word size w with redundancy r and query time t it holds that 3(r + 2)(tw + 1) n. Also, for any index for rank queries using word size w with redundancy r and query time t it holds that 2(2r + log 2 (w + 1))tw n log 2 (w + 1). Thus, for the case of t = O(1) and w = O(log n), we obtain the lower bound r Ω(n/ log n) for selection * Department of Computer Science, University of Aarhus. Supported by BRICS, Basic Research in Computer Science, a centre of the Danish National Science Foundation. indexes and the lower bound r Ω(n log n log n/ log n) for rank indexes. The latter matches the best known upper bound while the former does not. We leave as an open problem to get tight bounds for the size of a selection index. The two proofs are based on two quite different counting arguments, both being rather simple. The proof for selection indexes is related to (but simpler than) the lower bound proof for substring search of Demaine and L´ opez-Ortiz [2] and the lower bound for rank indexes is based on the “exposure game” technique of G´ al and Miltersen [3]. 2 Lower bound for selection Fix a scheme with parameters n, w, r, t. The form of the lower bound to be proved makes it without loss of generality to assume that w = 1. We can also assume t 1 as this is true in any valid scheme and n 12, as there is otherwise nothing to prove. Given a bit string x ∈{0, 1} n , we construct another bit string τ (x) and argue that the index structure φ(x) together with τ (x) is sufficient to reconstruct the original data x. Let m = (n/3 1)/tand restrict attention to inputs x of Hamming weight m. Given x, we first construct φ(x) and then define a string τ (x) ∈{0, 1} mt as follows: We run the selection query operation with parameter j for each j =1, ...m and concatenate the contents of the registers read by these operations. As each query operation inspects t registers, each containing one bit, the total length of τ (x) is mt bits. We now obtain τ (x) by initially setting τ (x)= τ (x) and then erasing certain bits of τ (x) by setting them to zero by applying the following two rules: If τ (x) i corresponds to a bit of the index structure φ(x) we set τ (x) i to 0. If τ (x) i and τ (x) k with i<k are both copies of the same bit in the input x, we let τ (x) k = 0. We now observe that if we have access to φ(x) and τ (x) and know the query algorithm, we can reconstruct τ (x) without having access to x. This is done by simulating the selection queries with parameter j for each j =1, .., m in increasing order while scanning the string τ (x) from left to right, thus associating to each bit of τ (x) an address in the structure x · φ(x). Each