SubTab: Data Exploration with Informative Sub-Tables Kathy Razmadze Tel Aviv University kathyr@mail.tau.ac.il Yael Amsterdamer Bar-Ilan University amstery@cs.biu.ac.il Amit Somech Bar-Ilan University somecha@cs.biu.ac.il Susan B. Davidson University of Pennsylvania susan@cis.upenn.edu Tova Milo Tel Aviv University milo@post.tau.ac.il ABSTRACT We demonstrate SubTab, a framework for creating small, informative sub-tables of large data tables to speed up data exploration. Given a table with  rows and  columns where  and  are large, SubTab creates a sub-table   with  <<  rows and  <<  columns, i.e. a subset of  rows of the table projected over a subset of  columns. The rows and columns are chosen as representatives of prominent data patterns within and across columns in the input table. SubTab can also be used for query results, enabling the user to quickly understand the results and determine subsequent queries. CCS CONCEPTS · Information systems → Data mining; Information integration; Database views; Data cleaning. KEYWORDS Data analysis; Data exploration; ACM Reference Format: Kathy Razmadze, Yael Amsterdamer, Amit Somech, Susan B. Davidson, and Tova Milo. 2022. SubTab: Data Exploration with Informative Sub-Tables. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22), June 12ś17, 2022, Philadelphia, PA, USA. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3514221.3520154 1 INTRODUCTION Data exploration is an important frst step in data analytics. During this step, the analyst tries to understand an unfamiliar dataset and determine what part of the data is relevant to their task by displaying the table, looking at the table description, or visualizing column values. However, when displaying a large table only a small subset of the table is typically shown ś and without input from the user, the subset is arbitrary. For example, the default display of Pandas 1 tables using the Python display() command includes the frst and last several rows and columns. Frequently, this is not very informative as the sub-table may contain a lot of missing values 1 Pandas: Python Data Analysis Library. https://pandas.pydata.org/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD ’22, June 12ś17, 2022, Philadelphia, PA, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9249-5/22/06. . . $15.00 https://doi.org/10.1145/3514221.3520154 SCHEDULED ARRIVAL AIR TIME WHEELS OFF WHEELS ON DISTANCE DEPARTURE TIME SCHEDULED DEPARTURE CANCELLED 2157 242.0 1505.0 2207.0 1979 1448.0 1450 0 1925 NaN NaN NaN 733 NaN 1605 1 1059 31.0 1009.0 1040.0 109 955 1000 0 846 140.0 606.0 826.0 954 9551.0 600 0 1419 77.0 1231.0 1348.0 602 1222.0 1237 0 1920 135.0 1646.0 1901.0 1013 1620.0 1625 0 1643 156.0 1240.0 1616.0 1276 1228.0 1235 0 2050 79.0 1932.0 2051.0 550 1916.0 1916 0 YEAR MONTH DAY DAY OF WEEK … SECURITY DELAY AIRLINE DELAY LATE AIRCRAFT DELAY WEATHER DELAY 2015 1 1 4 … NaN NaN NaN NaN 2015 1 1 4 … NaN NaN NaN NaN 2015 1 1 4 … NaN NaN NaN NaN … … … … … … … … … 2015 12 31 4 … NaN NaN NaN NaN 2015 12 31 4 … NaN NaN NaN NaN 2015 12 31 4 NaN NaN NaN NaN SubTab Pre-Processing Step: • binning • embedding Selecting Step: • clustering • centroid selection Query Result Query Raw Dataset (6M Rows X 31 Columns): Informative Sub-Table (8 Rows X 8 Columns): Figure 1: System Architecture and/or fail to capture the range of possible values in a column; it may also elide columns that are important for further exploration. Example 1.1. Consider a table  taken from the Kaggle fights dataset 2 which contains 31 columns and ∼6M rows. The analyst is using  to predict fight cancellations, and hence is interested in a specifc target column, CANCELLED. The analyst starts by visually inspecting the data using Pandas display(  ), which yields the table displayed at the top of Figure 1. This display of  is not informative for the analysis task, as it does not include the target column. More crucially, its usefulness for data exploration is limited: e.g., the last fve columns contain only NaN values, and other columns include many repetitions of arbitrary values. □ The goal of SubTab is to support the data exploration task by selecting small, informative sub-tables through which analysts can view data. Given a table  with  rows and  columns, SubTab creates a sub-table with  <<  rows and  <<  columns, which is a subset of  rows projected over a subset of  columns of  . Intuitively, a sub-table is informative if it captures data patterns within and across columns in  . We formalize this intuition based on a combination of two complementary metrics: coverage and di- versity. Coverage, measures how well the sub-table covers patterns in the data, where we defne patterns through the standard notion of association rules. For the second metric, diversity, we rely on the average pair-wise similarity between the rows of the sub-table. 2 https://www.kaggle.com/usdot/fight-delays?select=fights.csv Demonstration SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA 2369