External Merge Sort for Top-K Qeries Eager input filtering guided by histograms Yannis Chronis University of Wisconsin-Madison chronis@cs.wisc.edu Thanh Do Google Inc tddo@google.com Goetz Graefe Google Inc goetzg@google.com Keith Peters Google Inc petersk@google.com ABSTRACT Business intelligence and web log analysis workloads often use queries with top-k clauses to produce the most relevant results. Values of k range from small to rather large and sometimes the requested output exceeds the capacity of the available main memory. When the requested output fts in the available memory existing top-k algorithms are efcient, as they can eliminate almost all but the top k results before sorting them. When the requested output exceeds the main memory capacity, existing algorithms externally sort the en- tire input, which can be very expensive. Furthermore, the drastic diference in execution cost when the memory ca- pacity is exceeded results in an unpleasant user experience. Every day, tens of thousands of production top-k queries executed on F1 Query resort to an external sort of the input. To address these challenges, we introduce a new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fts in the available memory. To achieve this, at execution time our algorithm creates a concise model of the input using histograms. The proposed algorithm is implemented as part of F1 Query and is used in production, where signifcantly accelerates top-k queries with outputs larger than the available memory. We evaluate our algorithm against existing top-k algorithms and show that it reduces I/O trafc and can be up to 11× faster. Work done while at Google Inc. SIGMOD’20, June 14ś19, 2020, Portland, OR, USA © 2020 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6735-6/20/06. https://doi.org/10.1145/3318464.3389729 CCS CONCEPTS · Information systems Query operators. KEYWORDS Top-K; Query Operators; Out-of-core; ACM Reference Format: Yannis Chronis, Thanh Do, Goetz Graefe, and Keith Peters. 2020. External Merge Sort for Top-K Queries: Eager input fltering guided by histograms. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20), June 14ś19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 15 pages. https://doi. org/10.1145/3318464.3389729 1 INTRODUCTION When analyzing today’s huge data volumes, e.g. web logs, users typically want the most relevant results. Business intel- ligence and web analytics use top-k queries for the fnal or intermediate results. Users may want only a handful of result rows, but sometimes a large amount of data selected from a huge amount of data. Such queries are common in practice in big internet services companies and the requested output, k, can exceed the capacity of the main memory. For example a data scientist at Facebook might request the 50 million most commented and liked photos out of the 300 million photos posted each day [17]; an engineer at Twitter might want to perform trend analysis on the 10% most important tweets out of the 3.5 billion tweets of the past week[32]; an engineer at Google might calculate the intersection between the 40 million most active search users and the 40 million most active gmail users, the user bases of both services exceeds a billion users [34]; an operations analyst at Amazon might request half of the 100 million US prime members that are most likely to buy a certain product [18]. All methods for optimizing top-k algorithms attempt to eliminate input rows not needed in the output; ideally, before they are sorted. The standard way to evaluate top-k queries uses an in-memory priority queue [4]. The top of the priority queue is the last row to be included in the fnal output. As Research 27: Distributed and Parallel Processing SIGMOD ’20, June 14–19, 2020, Portland, OR, USA 2423 This work is licensed under a Creative Commons Attribution- NonCommercial-NoDerivs International 4.0 License.