External Merge Sort for Top-K Qeries
Eager input filtering guided by histograms
Yannis Chronis
∗
University of Wisconsin-Madison
chronis@cs.wisc.edu
Thanh Do
Google Inc
tddo@google.com
Goetz Graefe
Google Inc
goetzg@google.com
Keith Peters
Google Inc
petersk@google.com
ABSTRACT
Business intelligence and web log analysis workloads often
use queries with top-k clauses to produce the most relevant
results. Values of k range from small to rather large and
sometimes the requested output exceeds the capacity of the
available main memory. When the requested output fts in
the available memory existing top-k algorithms are efcient,
as they can eliminate almost all but the top k results before
sorting them. When the requested output exceeds the main
memory capacity, existing algorithms externally sort the en-
tire input, which can be very expensive. Furthermore, the
drastic diference in execution cost when the memory ca-
pacity is exceeded results in an unpleasant user experience.
Every day, tens of thousands of production top-k queries
executed on F1 Query resort to an external sort of the input.
To address these challenges, we introduce a new top-k
algorithm that is able to eliminate parts of the input before
sorting or writing them to secondary storage, regardless of
whether the requested output fts in the available memory.
To achieve this, at execution time our algorithm creates a
concise model of the input using histograms. The proposed
algorithm is implemented as part of F1 Query and is used
in production, where signifcantly accelerates top-k queries
with outputs larger than the available memory. We evaluate
our algorithm against existing top-k algorithms and show
that it reduces I/O trafc and can be up to 11× faster.
∗
Work done while at Google Inc.
SIGMOD’20, June 14ś19, 2020, Portland, OR, USA
© 2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6735-6/20/06.
https://doi.org/10.1145/3318464.3389729
CCS CONCEPTS
· Information systems → Query operators.
KEYWORDS
Top-K; Query Operators; Out-of-core;
ACM Reference Format:
Yannis Chronis, Thanh Do, Goetz Graefe, and Keith Peters. 2020.
External Merge Sort for Top-K Queries: Eager input fltering guided
by histograms. In Proceedings of the 2020 ACM SIGMOD International
Conference on Management of Data (SIGMOD’20), June 14ś19, 2020,
Portland, OR, USA. ACM, New York, NY, USA, 15 pages. https://doi.
org/10.1145/3318464.3389729
1 INTRODUCTION
When analyzing today’s huge data volumes, e.g. web logs,
users typically want the most relevant results. Business intel-
ligence and web analytics use top-k queries for the fnal or
intermediate results. Users may want only a handful of result
rows, but sometimes a large amount of data selected from a
huge amount of data. Such queries are common in practice in
big internet services companies and the requested output, k,
can exceed the capacity of the main memory. For example a
data scientist at Facebook might request the 50 million most
commented and liked photos out of the 300 million photos
posted each day [17]; an engineer at Twitter might want to
perform trend analysis on the 10% most important tweets
out of the 3.5 billion tweets of the past week[32]; an engineer
at Google might calculate the intersection between the 40
million most active search users and the 40 million most
active gmail users, the user bases of both services exceeds
a billion users [34]; an operations analyst at Amazon might
request half of the 100 million US prime members that are
most likely to buy a certain product [18].
All methods for optimizing top-k algorithms attempt to
eliminate input rows not needed in the output; ideally, before
they are sorted. The standard way to evaluate top-k queries
uses an in-memory priority queue [4]. The top of the priority
queue is the last row to be included in the fnal output. As
Research 27: Distributed and Parallel Processing SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
2423
This work is licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs International 4.0 License.