A Unified Optimization Algorithm For Solving “Regret-Minimizing Representative” Problems Suraj Shetiya , Abolfazl Asudeh , Sadia Ahmed , Gautam Das University of Texas at Arlington; University of Illinois at Chicago {suraj.shetiya@mavs, sadia.ahmed78@mavs, gdas@cse}.uta.edu, asudeh@uic.edu ABSTRACT Given a database with numeric attributes, it is often of interest to rank the tuples according to linear scoring functions. For a scoring function and a subset of tuples, the regret of the subset is defined as the (relative) difference in scores between the top-1 tuple of the subset and the top-1 tuple of the entire database. Finding the regret- ratio minimizing set (RRMS), i.e., the subset of a required size k that minimizes the maximum regret-ratio across all possible rank- ing functions, has been a well-studied problem in recent years. This problem is known to be NP-complete and there are several approx- imation algorithms for it. Other NP-complete variants have also been investigated, e.g., finding the set of size k that minimizes the average regret ratio over all linear functions. Prior work have de- signed customized algorithms for different variants of the problem, and are unlikely to easily generalize to other variants. In this paper we take a different path towards tackling these prob- lems. In contrast to the prior, we propose a unified algorithm for solving different problem variants. Unification is done by localiz- ing the customization to the design of variant-specific subroutines or “oracles” that are called by our algorithm. Our unified algorithm takes inspiration from the seemingly unrelated problem of cluster- ing from data mining, and the corresponding K- MEDOID algorithm. We make several innovative contributions in designing our algo- rithm, including various techniques such as linear programming, edge sampling in graphs, volume estimation of multi-dimensional convex polytopes, and several others. We provide rigorous theoret- ical analysis, as well as substantial experimental evaluations over real and synthetic data sets to demonstrate the practical feasibility of our approach. PVLDB Reference Format: Suraj Shetiya, Abolfazl Asudeh, Sadia Ahmed and Gautam Das. A Unified Optimization Algorithm For Solving “Regret-Minimizing Representative” Problems. PVLDB, 13(3): 239 - 251, 2019. DOI: https://doi.org/10.14778/3368289.3368291 1. INTRODUCTION Data-driven decision making is challenging when there are mul- tiple criteria to be considered. Consider a database of n tuples with d numeric attributes. In certain cases, “experts” can come up with a (usually linear) function to combine the criteria into a “goodness This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 3 ISSN 2150-8097. DOI: https://doi.org/10.14778/3368289.3368291 score” that reflects their preference for the tuples. This function can then be used for ranking and evaluating the tuples [14]. However, devising such a function is challenging [5, 6], hence not always a reasonable option, especially for ordinary non-expert users [7]. For instance consider a user who wants to book a hotel in Miami, FL. She wants to find a hotel that is affordable, is close to a beach, and has a good rating. It is not reasonable to expect her to come up with a ranking function, even though she may roughly know what she is looking for. Therefore, she will probably start exploring different options and may end up spending several confusing and frustrat- ing hours before she can finalize her decision. Alternatively, one could remove the set of “dominated” tuples [8], returning a Pareto- optimal [9] (a.k.a. skyline [7, 8]) set, which is the smallest set guaranteed to contain the “best” choice of the user, assuming that her preference is monotonic [8]. In the case where user preferences are further restricted to linear ranking functions, only the convex hull of the dataset needs to be returned. The problem with the skyline or convex hull is that they can be very large themselves, sometimes being a significant portion of the data [10, 11], hence they lose their appeal as a small representative set for facilitating decision making. Consequently, as outlined in § 7, there has been extensive effort to reduce the size of the set. Nanongkai et al. [10] came up with the elegant idea of finding a small set that may not contain the absolute “best” for any possi- ble user (ranking function), but guarantees to contain a satisfactory choice for each possible function. To do so, they defined the notion of “regret-ratio” of a representative subset of the dataset for any given ranking function as follows: it is the relative score difference between the best tuple in the database and the best tuple in the rep- resentative set. Given k<n, the task is to find the regret-ratio minimizing set (RRMS), i.e., a subset of size k that minimizes the maximum regret-ratio across all possible ranking functions. This problem is shown to be NP-complete, even for a constant (larger than two) number of criteria (attributes) [12]. Other researchers have also considered different versions of the problem formulation. For instance Chester et. al. [13] generalize the notion of regret from the comparison of the the actual top-1 of database to the top-k. More recently, in [14, 15] the goal was to compute the representa- tive set that minimizes the average regret-ratio across all possible functions, instead of minimizing the max regret-ratio. All these variants have been shown to be NP-complete. Given their intractable nature, there has been significant effort in designing efficient heuristics and approximation algorithms for these problems. The RRMS problem has been investigated in sev- eral papers [11, 12, 16], and several approximation algorithms have been designed; the algorithms in [11, 12] run in polynomial time and can approximate the max-regret ratio within any user-specified accuracy threshold. The average regret-ratio problem has been in- vestigated in [14], and a different greedy approach has been pro- 239