The Hybrid-Layer Index: A Synergic Approach to Answering Top-k Queries in Arbitrary Subspaces Jun-Seok Heo † , Junghoo Cho ‡ , and Kyu-Young Whang † † Department of Computer Science, KAIST, Korea, {jsheo,kywhang}@mozart.kaist.ac.kr ‡ University of California, Los Angeles, USA, cho@cs.ucla.edu Abstract—In this paper, we propose the Hybrid-Layer Index (simply, the HL-index) that is designed to answer top-k queries efficiently when the queries are expressed on any arbitrary subset of attributes in the database. Compared to existing approaches, the HL-index significantly reduces the number of tuples accessed during query processing by pruning unnecessary tuples based on two criteria, i.e., it filters out tuples both (1) globally based on the combination of all attribute values of the tuples like in the layer-based approach (simply, layer-level filtering) and (2) based on individual attribute values specifically used for ranking the tuples like in the list-based approach (simply, list-level filtering). Specifically, the HL-index exploits the synergic effect of integrating the layer-level filtering method and the list-level filtering method. Through an in-depth analysis of the interaction of the two filtering methods, we derive a tight bound that reduces the number of tuples retrieved during query processing while guaranteeing the correct query results. We propose the HL- index construction and retrieval algorithms and formally prove their correctness. Finally, we present the experimental results on synthetic and real datasets comparing the performance of the HL-index to other state-of-the-art indexes. Our experiments demonstrate that the HL-index shows the best (or close to best) performance in most scenarios regardless of the size of the dataset, the number of attributes in the tuples, and the number of attributes used in the queries. I. I NTRODUCTION Computing top-k answers quickly is becoming ever more important as the size of databases grows and as more users access data through interactive interfaces. When a database is large, it may take minutes (if not hours) to compute the complete answer to a query if the query matches millions of the tuples in the database. Most users, however, are interested in looking at just the top few results (ranked by a small set of attribute values that the users are interested in) and they want to see the results immediately after they issue the query. As an example, consider a database of digital cameras, which has many attributes such as price, manufacturer, model number, weight, size, pixel count, sensor size, etc. Among these attributes, a particular user is likely to be interested in a small subset when they make a decision to purchase. For example, a user who wants to buy a cheap compact digital camera will be mainly interested in the price and the weight and may issue a query like SELECT * FROM Cameras ORDER BY 0.5*price+0.5*weight ASC LIMIT k. Another user who primarily cares about the quality of the pictures will be more interested in the pixel count and sensor size and issue a query like SELECT * FROM Cameras ORDER BY 0.4*pixelCount+0.6*sensorSize DESC LIMIT k. To handle scenarios like the above, we propose the Hybrid- Layer Index (simply, the HL-index) that is designed to answer top-k queries on an arbitrary subset of the attributes efficiently. There exist a number of approaches for efficient computation of top-k answers. For example, in their seminal work, Fagin et al.[10], [11] designed a series of algorithms that consider a tuple as a potential top-k answer only if the tuple is ranked high in at least one of the attributes used for ranking. We refer to this approach as the list-based approach because the algorithms require maintaining one sorted list per each attribute. While this approach shows significant improvement compared to earlier work, it often considers an unnecessarily large number of tuples. For instance, when a tuple is ranked high in one attribute but low in all others, the tuple is likely to be ranked low in the final answer and can potentially be ignored, but the list-based approach has to consider it because of its high rank in that one attribute. As the size of the database grows, this becomes an acute problem because there are likely to be more tuples that are ranked high in one attribute but low overall. To avoid this pitfall, Chang et al.[7] proposed an algorithm that constructs a global index based on the combination of all attribute values and uses this index for top-k answer computation. We refer to this approach as the layer-based approach because it builds an index that partitions the tuples into multiple layers. The layer-based approach avoids the pitfall of the Fagin’s algorithms, but it also has the opposite problem. Because the index is constructed on all attributes, it does not perform well when the query ranks tuples by a small subset of the attributes. A tuple may be ranked high globally on many attributes, but it may be ranked low for a particular subset of attributes used for a query. One simple way to address the drawback of the layer-based approach is to build one dedicated index per every subset of attributes and use the appropriate index for a query as in [9], [14]. We refer to these approaches as the view-based approach. Clearly, view-based approaches lead to high query performance, but they also incur significant space overhead. Our proposed HL-index tries to avoid all pitfalls of the ex- isting approaches in the following ways. By careful integration of the list-based and the layer-based approaches, it is able to filter out a tuple both by the global combination of all of its