The PH-Tree Revisited Tilmann Zäschke www.phtree.org zaeschke@gmx.org 13 February 2016 – Revision 1.1 ABSTRACT We present several algorithms and improvements for the PH- Tree [14], a general purpose multi-dimensional index. The algorithms include a skyline query, nearest neighbour queries (kNN queries), range queries and a special update() oper- ator for moving data. The improvements include a solu- tion for the problem of large nodes, resulting in much better scalability with high dimensionality k when updating the tree. Another improvement uses generic data preprocessing to avoid problems with special cases of clustered datasets such as the CLUSTER 0.5 dataset in [14]. Finally, we dis- cuss how the PH-Tree can be used to store hyper-rectangles instead of points and how to perform efficient queries on stored hyper-rectangles. 1. INTRODUCTION The PH-Tree 1 is a multi-dimensional index that belongs to the family of quadtrees but provides much better space efficiency and scalability with higher dimensionality. The PH-Tree also scales very well with large datasets, in some cases larger sets with N> 10 6 actually perform better than smaller datasets. The PH-Tree also provides implicit Z- ordering by combining its quadtree features with the ap- proach of interleaving bits as it done in critbit-trees (also known as binary prefix tries). However, the PH-Tree goes beyond a simple combination of quadtrees and critbit trees by using unique and efficient algorithms for multi-dimensional hypercube navigation which allow processing of up to 64 di- mensions in parallel on CPUs with 64bit registers. The contributions of this paper are as follows: • Discussion of the trees properties. • Solution to the insertion performance problem for in- creasing k. • Solution to performance problems with special cases of clustered data with k ≥ 10. • kNN queries, range queries and moving objects. • Storage and querying of rectangle shapes. • Performance tuning recommendations. 1 The complete source code is available at http://www. phtree.org Copyright is held by Tilmann Zäschke. License: CC BY 3.0 (http://creativecommons.org/licenses/by/3.0/) http://dx.doi.org/10.13140/RG.2.1.1594.0567. The PH-Tree was originally published in [14]. First, in Section 2, we recapitulate the PH-Tree as presented in the original paper. Section 3 discusses first general properties of the PH-Tree and then specific explanations about non- intuitive behaviour. Section 4 discusses some structural im- provements to the original version that solve for example the problem of slow updates with increasing k. In Section 5 we explain how data preprocessing can be used to avoid perfor- mance problems with special cases of clustered data. Then, in Section 6 we explain new algorithms, such as kNN queries, range queries and update(). After that we explain in Sec- tion 7 how the tree can efficiently be used to store rectan- gles instead of points. Then we give in Section 8 suggestions for getting the best performance out of the PH-Tree. Sec- tion 9 discusses open questions and possible improvements. Finally, we give some concluding remarks in Section 10. 1.1 Related Work One resource for related work is the original PH-Tree pub- lication [14]. Additional work on data preprocessing for the PH-Tree is available in [3]. Parallelization and cluster com- puting with the PH-Tree have been explored in [13]. In addition we would like to point out the very useful ELKI framework [7, 10] for research in k-dimensional data min- ing. 1.2 Terminology Most terminology is described in [14]. However, some ter- minology has been updated or added: Range Queries & Window Queries. The original paper uses the term range query for queries on a (hyper-)rectangular section of the data. This conflicts with other definitions, such as in [7, 10], which use the same term for queries on a (hyper-)spherical section of the data, i.e. everything within a given (euclidean) range. In the text at hand we use window query for queries on (hyper- )rectangles and range query for queries on (hyper-)spheres. HC, AHC, LHC, BHC, NI. The original paper distinguished two possible data repre- sentations in a node with HC (HyperCube) and LHC (Lin- earized HyperCube). We now define three representations: • AHC: Array HyperCube (formerly HC) • LHC: Linearized HyperCube • BHC: Binary Hypercube (also sometimes called NI), see Section 4.1