DoS: An Efﬁcient Scheme for the Diversiﬁcation of Multiple Search Results Hina A. Khan School of ITEE The University of Queensland Queensland, Australia h.khan3@uq.edu.au Marina Drosou * Computer Science Dept. University of Ioannina Ioannina, Greece mdrosou@cs.uoi.gr Mohamed A. Sharaf School of ITEE The University of Queensland Queensland, Australia m.sharaf@uq.edu.au ABSTRACT Data diversiﬁcation provides users with a concise and mean- ingful view of the results returned by search queries. In addi- tion to taming the information overload, data diversiﬁcation also provides the beneﬁts of reducing data communication costs as well as enabling data exploration. The explosion of big data emphasizes the need for data diversiﬁcation in mod- ern data management platforms, especially for applications based on web, scientiﬁc, and business databases. Achiev- ing eﬀective diversiﬁcation, however, is rather a challenging task due to the inherent high processing costs of current data diversiﬁcation techniques. This challenge is further accentu- ated in a multi-user environment, in which multiple search queries are to be executed and diversiﬁed concurrently. In this paper, we propose the DoS scheme, which addresses the problem of scalable diversiﬁcation of multiple search results. Our experimental evaluation shows the scalability exhibited by DoS under various workload settings, and the signiﬁcant beneﬁts it provides compared to sequential methods. 1. INTRODUCTION Users interact with a number of applications by submitting queries to satisfy their information needs. Today, the ex- plosion of the available data in many web, scientiﬁc and business domains dictates the need for the development of eﬀective methods to assist users in quickly locating results of high interest. Towards this, search result diversiﬁcation is one such method that has been widely employed to assist users in that direction [3]. In particular, search result diver- siﬁcation is to select an interesting representative subset of the available results. Realizing an eﬀective diversiﬁcation system, however, is a rather challenging task. This is primarily due to the in- herent high processing costs of current data diversiﬁcation * Work done while author was visiting the School of ITEE at the University of Queensland. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from Permissions@acm.org. SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00 algorithms. This challenge is further complicated in a multi- user environment, in which multiple search queries are to be executed and diversiﬁed concurrently. For instance, in the Sloan Digital Sky Survey (SDSS) science database 1 , the SDSS Catalog Archive Server (CAS) employs a multi-server, multi-queue batch job submission and tracking system [7]. From a system perspective, this leads to the simultaneous execution of a large number of queries that are essentially of exploratory nature [2]. Such an environment highlights the need for a scalable diversiﬁcation system that is able to eﬀectively facilitate data exploration tasks. In this paper, we propose the DoS (Diversiﬁcation of Mul- tiple Search Results) scheme that addresses the problem of eﬃciently diversifying the results of multiple queries. To- wards this goal, DoS leverages the natural overlap in search results in conjunction with the concurrent diversiﬁcation of those overlapping results. This enables DoS to provide the same quality of diversiﬁcation as that of the sequential meth- ods, while signiﬁcantly reducing the processing costs. Our experimental evaluation on both real and synthetic data sets shows the scalability exhibited by DoS under various workload settings, and the signiﬁcant beneﬁts it provides compared to sequential methods. Roadmap: We deﬁne search result diversiﬁcation in Sec- tion 2. We introduce multiple search result diversiﬁcation and our DoS scheme in section 3. The evaluation results are reported in Section 4, and we conclude in Section 5. 2. BACKGROUND AND RELATED WORK In this work, we consider queries that retrieve a number of results, or items, from the database. Database results can be generally viewed as a set of attribute values repre- sented as data points in a multi-dimensional data space. In such systems, a query is ﬁrst processed against the stored database, and the generated results are then further pro- cessed in-memory through a diversiﬁcation system. Let X = {x1,...,xm} be a set of results for some user query. In general, the goal of result diversiﬁcation is to select a subset S ∗ of X with |S ∗ | = k, k ≤ m, such that the diversity of the results in S ∗ is maximized. In this work, we primarily focus on the widely used content- based deﬁnition of diversity. Content diversity is an instance 1 http://www.sdss.org 1