DoS: An Efficient Scheme for the Diversification of Multiple Search Results Hina A. Khan School of ITEE The University of Queensland Queensland, Australia h.khan3@uq.edu.au Marina Drosou * Computer Science Dept. University of Ioannina Ioannina, Greece mdrosou@cs.uoi.gr Mohamed A. Sharaf School of ITEE The University of Queensland Queensland, Australia m.sharaf@uq.edu.au ABSTRACT Data diversification provides users with a concise and mean- ingful view of the results returned by search queries. In addi- tion to taming the information overload, data diversification also provides the benefits of reducing data communication costs as well as enabling data exploration. The explosion of big data emphasizes the need for data diversification in mod- ern data management platforms, especially for applications based on web, scientific, and business databases. Achiev- ing effective diversification, however, is rather a challenging task due to the inherent high processing costs of current data diversification techniques. This challenge is further accentu- ated in a multi-user environment, in which multiple search queries are to be executed and diversified concurrently. In this paper, we propose the DoS scheme, which addresses the problem of scalable diversification of multiple search results. Our experimental evaluation shows the scalability exhibited by DoS under various workload settings, and the significant benefits it provides compared to sequential methods. 1. INTRODUCTION Users interact with a number of applications by submitting queries to satisfy their information needs. Today, the ex- plosion of the available data in many web, scientific and business domains dictates the need for the development of effective methods to assist users in quickly locating results of high interest. Towards this, search result diversification is one such method that has been widely employed to assist users in that direction [3]. In particular, search result diver- sification is to select an interesting representative subset of the available results. Realizing an effective diversification system, however, is a rather challenging task. This is primarily due to the in- herent high processing costs of current data diversification * Work done while author was visiting the School of ITEE at the University of Queensland. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00 algorithms. This challenge is further complicated in a multi- user environment, in which multiple search queries are to be executed and diversified concurrently. For instance, in the Sloan Digital Sky Survey (SDSS) science database 1 , the SDSS Catalog Archive Server (CAS) employs a multi-server, multi-queue batch job submission and tracking system [7]. From a system perspective, this leads to the simultaneous execution of a large number of queries that are essentially of exploratory nature [2]. Such an environment highlights the need for a scalable diversification system that is able to effectively facilitate data exploration tasks. In this paper, we propose the DoS (Diversification of Mul- tiple Search Results) scheme that addresses the problem of efficiently diversifying the results of multiple queries. To- wards this goal, DoS leverages the natural overlap in search results in conjunction with the concurrent diversification of those overlapping results. This enables DoS to provide the same quality of diversification as that of the sequential meth- ods, while significantly reducing the processing costs. Our experimental evaluation on both real and synthetic data sets shows the scalability exhibited by DoS under various workload settings, and the significant benefits it provides compared to sequential methods. Roadmap: We define search result diversification in Sec- tion 2. We introduce multiple search result diversification and our DoS scheme in section 3. The evaluation results are reported in Section 4, and we conclude in Section 5. 2. BACKGROUND AND RELATED WORK In this work, we consider queries that retrieve a number of results, or items, from the database. Database results can be generally viewed as a set of attribute values repre- sented as data points in a multi-dimensional data space. In such systems, a query is first processed against the stored database, and the generated results are then further pro- cessed in-memory through a diversification system. Let X = {x1,...,xm} be a set of results for some user query. In general, the goal of result diversification is to select a subset S ∗ of X with |S ∗ | = k, k ≤ m, such that the diversity of the results in S ∗ is maximized. In this work, we primarily focus on the widely used content- based definition of diversity. Content diversity is an instance 1 http://www.sdss.org 1