Learned Cardinality Estimation: An In-depth Study
Kyoungmin Kim
†
Jisung Jung
†
In Seo
†
Wook-Shin Han
†∗
Kangwoo Choi
‡
Jaehyok Chong
‡
Pohang University of Science and Technology (POSTECH), Korea
†
SAP Labs, Korea
‡
{kmkim, jsjeong, iseo, wshan}@dblab.postech.ac.kr
†
{kangwoo.choi, ja.chong}@sap.com
‡
ABSTRACT
Learned cardinality estimation (CE) has recently gained significant
attention for replacing long-studied traditional CE with machine
learning, especially for deep learning. However, these estimators
were developed independently and have not been fairly or compre-
hensively compared in common settings. Most studies use a subset
of IMDB data which is too simple to measure their limits and deter-
mine whether they are ready for real, complex data. Furthermore,
they are regarded as black boxes, without a deep understanding of
why large errors occur.
In this paper, we first provide a taxonomy and a unified workflow
of learned estimators for a better understanding of estimators. We
next comprehensively compare recent learned CE methods that
support joins, from a subset of tables to full IMDB and TPC-DS
datasets. Under the experimental results, we then demystify the
black-box models and analyze critical components that cause large
errors. We also measure their impact on query optimization. Finally,
based on the findings, we suggest realizable research opportunities.
We believe that a deeper understanding of the behavior of exist-
ing methods can provide a more comprehensive and substantial
framework for developing better estimators.
CCS CONCEPTS
• Information systems → Query optimization; • Computing
methodologies → Machine learning.
KEYWORDS
Cardinality estimation
ACM Reference Format:
Kyongmin Kim, Jisung Jeong, In Seo, Wook-Shin Han, Kangwoo Choi, and
Jaehyok Chong. 2022. Learned Cardinality Estimation: An In-depth Study.
In Proceedings of the 2022 International Conference on Management of Data
(SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY,
USA, 14 pages. https://doi.org/10.1145/3514221.3526154
1 INTRODUCTION
Recent advances in deep learning have influenced database research
areas as well, including the query optimization in DBMSs [11, 31].
Especially, the cardinality estimation (CE) of intermediate results
has gained significant attention [1, 12, 17, 19, 24, 28] since it lies at
∗
corresponding author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9249-5/22/06. . . $15.00
https://doi.org/10.1145/3514221.3526154
the core of query optimization for producing good execution plans
[9].
We focus on the learned CE based on machine learning (ML)
and deep learning (DL) techniques, to replace the traditional CE
methods relying on ad-hoc assumptions of the data, e.g., uniformity
and independence [10]. By removing such assumptions, learned
estimators improve estimation accuracy by a large margin. However,
as raised in [25], the question remains whether learned estimators
are ready for production, especially for real, complex data with
joins.
Existing learned CE benchmarks [18, 25] and methods [3, 27, 28,
32] neither compare against existing methods comprehensively nor
reflect real, complex data. They either 1) use single-table datasets
only [1, 3, 4, 19, 25], 2) use multi-table datasets that are too simple
(star-shaped schema with six tables only) [5, 8, 26, 28], or 3) compare
with a few DL architectures or traditional methods only [3, 18].
Furthermore, all these studies treat the learned estimators as black-
box models [25] without a deep understanding of why large errors
occur. Without such understanding, using them in commercial
DBMSs with real data would be dangerous.
In this paper, we conduct in-depth experiments and analysis of
the learned estimators (published at major conferences and journals
before July 2021) using the simple settings of previous studies to
more complex settings, including high-dimensional data in IMDB
and TPC-DS benchmarks over 20 tables and 400 columns. We add-
ress the limitation of the learned estimators, demystify when and
where they fail, and suggest research opportunities to overcome
their problems. Apart from suggestions, we provide meaningful
improvements over the existing methods that even outperform
state-of-the-art methods without relying on ensemble learning.
In summary, we
• provide a taxonomy and a unified workflow of learned esti-
mators for a high-level understanding (Section 2);
• propose new variations of the learned estimators that of-
ten outperform state-of-the-art methods on a database com-
monly used in previous studies (Section 2);
• comprehensively compare learned estimators using various
datasets and workloads, including our synthetic environ-
ments (Section 3);
• demystify the black-box models and analyze the critical com-
ponents that affect the performance (Section 4);
• improve the query optimization quality by injecting the
learned cardinalities (Section 5); and
• summarize lessons learned and propose practical research
opportunities that can facilitate future studies (Section 6).
Session 17: Query Processing and Optimization 2 SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
1214