Skyline Query Processing for Incomplete Data Mohamed E. Khalefa Mohamed F. Mokbel Justin J. Levandoski Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN {khalefa,mokbel,justin@cs.umn.edu} Abstract—Recently, there has been much interest in processing skyline queries for various applications that include decision making, personalized services, and search pruning. Skyline queries aim to prune a search space of large numbers of multi- dimensional data items to a small set of interesting items by eliminating items that are dominated by others. Existing skyline algorithms assume that all dimensions are available for all data items. This paper goes beyond this restrictive assumption as we address the more practical case of involving incomplete data items (i.e., data items missing values in some of their dimensions). In contrast to the case of complete data where the dominance relation is transitive, incomplete data suffer from non-transitive dominance relation which may lead to a cyclic dominance behavior. We first propose two algorithms, namely, “Replacement” and “Bucket” that use traditional skyline algorithms for incomplete data. Then, we propose the “ISkyline” algorithm that is designed specifically for the case of incomplete data. The “ISkyline” algorithm employs two optimization tech- niques, namely, virtual points and shadow skylines to tolerate cyclic dominance relations. Experimental evidence shows that the “ISkyline” algorithm significantly outperforms variations of traditional skyline algorithms. I. I NTRODUCTION Given a search space of D independent dimensions, u 1 , u 2 , ··· , u d , a point p i is said to dominate another point p j if the value of p i .u k is better than or equal than that of p j .u k over all dimensions 1 ≤ k ≤ D and with a dimension l such that p i .u l >p j .u l . A skyline query over a set S of D-dimensional points aims to find a set of points S sky ⊆ S where any point p sky ∈ S sky is not dominated by any point in S while each point p i ∈ S - S sky is dominated by some point in S. In general, a skyline query reduces the search space S to only the set of skyline points S sky that are of interest to the user. Skyline queries are widely applicable to multi-criteria decision making applications. For example, consider the classical scenario where a user wants to reserve a hotel that is near to the conference site and cheaper in price among a large set of hotels. A hotel h i is represented as a two-dimensional point (d i ,r i ) where d i and r i represent the distance and price of the hotel, respectively. Rather than investigating in the whole space of the hotels, a skyline query eliminates any hotel h j where there is another hotel h k that is both cheaper and closer to the conference site than h j . Another example of skyline queries is a movie rating application (e.g., MovieLens [1]) in which D system users rank various movies. This work is supported in part by the Grant-in-Aid of Research, Artistry, and Scholarship, University of Minnesota, DTC Digital Technology Initiative Program, University of Minnesota, and DTC Intelligent Storage Consortium (DISC), University of Minnesota. In this case, each movie is represented as a D-dimensional point where each dimension corresponds to a certain user. When searching for the best movie, a skyline query eliminates those movies for which all users agree there exists at least one other superior (i.e., overall better-ranked) movie. Due to the importance of skyline queries, several research efforts have been dedicated to develop efficient skyline query processors (e.g., see [2], [3], [4], [5], [6], [7]). Almost all of these algorithms rely mainly on two implicit assumptions: (1) Data are complete, i.e., all dimensions are available for all data items. Such an assumption of completeness is not practical in many cases. For example, consider the movie rating application [1] with hundreds of users rating thousands of movies. It is highly unlikely that every single user will rate all movies. Instead, a user will rate only the movies that interest her. As a result, each movie will be represented as a D-dimensional point with several blank (i.e., incomplete) dimensions. Another example is from the hotel application where some hotels may not disclose some of their properties. These undisclosed properties are represented as incomplete entries within the hotel multi-dimensional point representation. (2) With the exception of [2], all skyline algorithms assume transitivity in the dominance relation, i.e., if data item p i dominates p j while p j dominates p k , then p i dominates p k . Using the transitivity property, skyline query processing algorithms exploit various ways of data pruning and indexing. Unfortunately, as will be seen in this paper, the transitive dominance relation is not applicable to the case of incomplete data. In this paper, we go beyond the completeness assumption of multi-dimensional input data where we develop new al- gorithms for efficient computation of skyline queries over incomplete data sets. The main reason for the need of a new set of algorithms for incomplete data is that the transitive dominance relation no longer holds. For example, we could have three data items p i , p j , and p k , where p i dominates p j , p j dominates p k , while p k dominates p i . In this case, we are not only missing the transitive dominance relation as p i does not dominate p k , but we also face another problem where we have a cyclic dominance relation between p i , p j , p k . Under this cyclic dominance relation, none of these three points can be considered a skyline as each point is dominated by at least one other point. We start by introducing two variations of traditional skyline algorithms to accommodate the existence of incomplete data, namely, the Replacement, and the Bucket algorithms. Then, we introduce the ISkyline algorithm as a specialized algorithm for