Learned Cardinality Estimation: An In-depth Study Kyoungmin Kim Jisung Jung In Seo Wook-Shin Han †∗ Kangwoo Choi Jaehyok Chong Pohang University of Science and Technology (POSTECH), Korea SAP Labs, Korea {kmkim, jsjeong, iseo, wshan}@dblab.postech.ac.kr {kangwoo.choi, ja.chong}@sap.com ABSTRACT Learned cardinality estimation (CE) has recently gained significant attention for replacing long-studied traditional CE with machine learning, especially for deep learning. However, these estimators were developed independently and have not been fairly or compre- hensively compared in common settings. Most studies use a subset of IMDB data which is too simple to measure their limits and deter- mine whether they are ready for real, complex data. Furthermore, they are regarded as black boxes, without a deep understanding of why large errors occur. In this paper, we first provide a taxonomy and a unified workflow of learned estimators for a better understanding of estimators. We next comprehensively compare recent learned CE methods that support joins, from a subset of tables to full IMDB and TPC-DS datasets. Under the experimental results, we then demystify the black-box models and analyze critical components that cause large errors. We also measure their impact on query optimization. Finally, based on the findings, we suggest realizable research opportunities. We believe that a deeper understanding of the behavior of exist- ing methods can provide a more comprehensive and substantial framework for developing better estimators. CCS CONCEPTS Information systems Query optimization; Computing methodologies Machine learning. KEYWORDS Cardinality estimation ACM Reference Format: Kyongmin Kim, Jisung Jeong, In Seo, Wook-Shin Han, Kangwoo Choi, and Jaehyok Chong. 2022. Learned Cardinality Estimation: An In-depth Study. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3514221.3526154 1 INTRODUCTION Recent advances in deep learning have influenced database research areas as well, including the query optimization in DBMSs [11, 31]. Especially, the cardinality estimation (CE) of intermediate results has gained significant attention [1, 12, 17, 19, 24, 28] since it lies at corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9249-5/22/06. . . $15.00 https://doi.org/10.1145/3514221.3526154 the core of query optimization for producing good execution plans [9]. We focus on the learned CE based on machine learning (ML) and deep learning (DL) techniques, to replace the traditional CE methods relying on ad-hoc assumptions of the data, e.g., uniformity and independence [10]. By removing such assumptions, learned estimators improve estimation accuracy by a large margin. However, as raised in [25], the question remains whether learned estimators are ready for production, especially for real, complex data with joins. Existing learned CE benchmarks [18, 25] and methods [3, 27, 28, 32] neither compare against existing methods comprehensively nor reflect real, complex data. They either 1) use single-table datasets only [1, 3, 4, 19, 25], 2) use multi-table datasets that are too simple (star-shaped schema with six tables only) [5, 8, 26, 28], or 3) compare with a few DL architectures or traditional methods only [3, 18]. Furthermore, all these studies treat the learned estimators as black- box models [25] without a deep understanding of why large errors occur. Without such understanding, using them in commercial DBMSs with real data would be dangerous. In this paper, we conduct in-depth experiments and analysis of the learned estimators (published at major conferences and journals before July 2021) using the simple settings of previous studies to more complex settings, including high-dimensional data in IMDB and TPC-DS benchmarks over 20 tables and 400 columns. We add- ress the limitation of the learned estimators, demystify when and where they fail, and suggest research opportunities to overcome their problems. Apart from suggestions, we provide meaningful improvements over the existing methods that even outperform state-of-the-art methods without relying on ensemble learning. In summary, we provide a taxonomy and a unified workflow of learned esti- mators for a high-level understanding (Section 2); propose new variations of the learned estimators that of- ten outperform state-of-the-art methods on a database com- monly used in previous studies (Section 2); comprehensively compare learned estimators using various datasets and workloads, including our synthetic environ- ments (Section 3); demystify the black-box models and analyze the critical com- ponents that affect the performance (Section 4); improve the query optimization quality by injecting the learned cardinalities (Section 5); and summarize lessons learned and propose practical research opportunities that can facilitate future studies (Section 6). Session 17: Query Processing and Optimization 2 SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA 1214