Multimedia Systems (2007) 12:533–550 DOI 10.1007/s00530-006-0070-9 REGULAR PAPER MKL-tree: an index structure for high-dimensional vector spaces Annalisa Franco · Alessandra Lumini · Dario Maio Published online: 9 November 2006 © Springer-Verlag 2006 Abstract In this work, a novel hierarchical data structure for high dimensional data indexing is pro- posed. MKL-tree is based on dimensionality reduction operated by means of the MKL transform, a multi-space generalization of the KL transform. A local dimension- ality reduction is performed at each node of the tree, allowing more selective features to be extracted and thus increasing the discriminating power of the index. The mathematical foundation for nodes and leaves rep- resentation and for the techniques aimed to manage the structure is detailed. Moreover, the algorithms for bulk loading MKL-tree (i.e., for creating the tree given a large number of objects simultaneously), for updating and splitting nodes after the insertion of new objects and for performing similarity searches are described. Results are reported for the comparison of MKL-tree with other well-known access methods in terms of I/O and CPU costs and precision of the result in the execu- tion of similarity queries. Keywords High-dimensional data · Index structures · Similarity search · Dimensionality reduction A. Franco (B ) · A. Lumini Corso di Laurea in Scienze dell’Informazione, Università di Bologna, via Sacchi 3, 47023, Cesena, Italy e-mail: franco@csr.unibo.it A. Lumini e-mail: lumini@csr.unibo.it D. Maio DEIS - CSITE-CNR - Università di Bologna, viale Risorgimento 2, 40136, Bologna, Italy e-mail: dmaio@deis.unibo.it 1 Introduction Similarity search in multidimensional databases is a problem widely discussed in the literature [9, 34] and a variety of data structures [6, 20, 37] for indexing vec- tor spaces has been proposed, where objects are usually represented as feature vectors belonging to high-dimen- sional spaces and are searched by similarity according to a given example. Many of these structures work well in low up to medium dimensionality but, as a conse- quence of the phenomenon known as “dimensionality curse” [3], they are often outperformed by a simple lin- ear scan, for dimensionality above 20–30. This problem is usually dealt with by applying a dimensionality reduction technique: the data to be indexed are ﬁrst reduced to a lower dimensionality by means of the Karhunen–Loève (KL) transform [19, 23] and then indexed with a traditional data structure. This approach, usually referred to as global dimensionality reduction (GDR), works well when the dataset is sta- tic and globally correlated. This assumption does not usually hold in real applications, where GDR produces an excessive loss of information and, as a consequence, poor query performance. Recently, new techniques have been proposed to deal with these problems: a novel method for performing SVD-based dimensionality reduction in dynamic databases [25] and a local reduc- tion technique, named local dimensionality reduction (LDR) [13]. LDR consists of an indexing structure based on data partitioning in locally correlated subsets, each of which is projected into the KL subspace associated to its elements and indexed independently of each other by a traditional structure (Hybrid-tree [14] is suggested). LDR outperforms GDR for locally correlated datasets; however, it requires a representative set of data to be