Using Compressed B + -trees for Line-based Database Indexes Hung-Yi Lin 1 and Chin-Ling Chen 2 1. Department of Finance, Chaoyang University of Technology, No. 168, Jifong E. Rd., Wufong Township, Taichung County 41349, Taiwan (R.O.C.), linhy@mail.cyut.edu.tw 2. Department of Computer Science and Information Engineering, Chaoyang University of Technology, No. 168, Jifong E. Rd., Wufong Township, Taichung County 41349, Taiwan (R.O.C.), clc@mail.cyut.edu.tw Abstract-In this paper, we propose a new indexing method called compressed B + -tree. The traditional B + -tree is the most common dynamic index structures in database systems. However, in practical applications, its performance still remains considerable room for improvement. Compressed B + -tree outperforms traditional access methods in two respects. The first is more economic storage requirement in the indexing structure. The second is better performance in retrieval. In addition, a compressed B + -tree can be used for the implementation of line- based database indexes. Keywords-B + -trees, compressed B + -trees, index structure, multi- dimensional databases, splitting policy. 1. INTRODUCTION Balanced trees are an essential data structure for implementation of database indexes because they offer rapid access to any record in the database. With an increasing number of computer applications relied heavily on multi-dimensional data [4, 5, 13, 15], the database community has devoted considerable attention to multi- dimensional database management. Storage and retrieval of multi-dimensional data has been frequently discussed in the database community. System performance of a database index depends on hierarchical structure. The hierarchical construction of an index structure relies on three factors. First is the distribution of indexed data in the space. In most practical applications, the data are typically skewed and non- uniform. The second factor is the data insertion order. As generally known, data with different insertion orders will result in different hierarchical directories. The third is data insertion algorithm. A poor algorithm may not be capable of coordinating indexed data into a well-organized structure. Data insertion algorithms in many literatures [3, 11, 12, 14] suffer from their splitting policies. In fact, an improper splitting policy interferes in the hierarchical directory of an index structure. We propose a better insertion mechanism to reduce and even eliminate these fatal impacts on the indexing structure. 2. BACKGROUND Considerable work has been devoted to the appropriate organization of trees for indexes. B + -trees [1, 2, 3, 9] are the most common dynamic index structures in database systems. However, not so much has been reported on the appropriate organization of the keys within each tree node, even though this organization can have a considerable impact on the total cost. The important design issue in this paper is to reconsider entry arrangement between a full target leaf and its siblings before one split is involved. Since data amount in a multi-dimensional database tends to be large, system performance is usually crucial. Two principles, space and time efficiencies, are generally taken in evaluating system performance. Space efficiency includes two parts: storage requirement and storage utilization. System storage requirement (denoted by SSR) measures the total amount of storage space for preserving the whole index structure. That is, SSR=(total number of allocated nodes)×(maximum node size) Suppose one entry occupies k bytes, then the practical memory demand for an index structure is k×SSR bytes. System storage utilization (denoted by SSU) measures the occupational situation in each tree node. Without loss of generality, we define SSU as following. % 100 entries indexed of number total × = SSR SSU The more compact the tree construction is, the higher the SSU will be. Nevertheless, compactness doesn’t guarantee economy. An index structure with space efficiency should maximize its SSU and minimize its SSR. Time efficiency includes two major parts: system maintaining performance (insertion and deletion operations) and data retrieval performance (query and search operations). No matter what operation is applied, time cost is directly proportional to the depth of index. A deeper index implies more nodes and then more disk blocks are involved for data retrieval. As a result, a deeper index has the poorer time efficiency. 3. COMPRESSED B + -TREES 2006 IEEE International Symposium on Signal Processing and Information Technology 0-7803-9754-1/06/$20.00©2006 IEEE 258