SubTab: Data Exploration with Informative Sub-Tables
Kathy Razmadze
Tel Aviv University
kathyr@mail.tau.ac.il
Yael Amsterdamer
Bar-Ilan University
amstery@cs.biu.ac.il
Amit Somech
Bar-Ilan University
somecha@cs.biu.ac.il
Susan B. Davidson
University of Pennsylvania
susan@cis.upenn.edu
Tova Milo
Tel Aviv University
milo@post.tau.ac.il
ABSTRACT
We demonstrate SubTab, a framework for creating small, informative
sub-tables of large data tables to speed up data exploration. Given
a table with rows and columns where and are large, SubTab
creates a sub-table
with << rows and << columns, i.e.
a subset of rows of the table projected over a subset of columns.
The rows and columns are chosen as representatives of prominent
data patterns within and across columns in the input table. SubTab
can also be used for query results, enabling the user to quickly
understand the results and determine subsequent queries.
CCS CONCEPTS
· Information systems → Data mining; Information integration;
Database views; Data cleaning.
KEYWORDS
Data analysis; Data exploration;
ACM Reference Format:
Kathy Razmadze, Yael Amsterdamer, Amit Somech, Susan B. Davidson,
and Tova Milo. 2022. SubTab: Data Exploration with Informative Sub-Tables.
In Proceedings of the 2022 International Conference on Management of Data
(SIGMOD ’22), June 12ś17, 2022, Philadelphia, PA, USA. ACM, New York, NY,
USA, 4 pages. https://doi.org/10.1145/3514221.3520154
1 INTRODUCTION
Data exploration is an important frst step in data analytics. During
this step, the analyst tries to understand an unfamiliar dataset
and determine what part of the data is relevant to their task by
displaying the table, looking at the table description, or visualizing
column values. However, when displaying a large table only a small
subset of the table is typically shown ś and without input from
the user, the subset is arbitrary. For example, the default display
of Pandas
1
tables using the Python display() command includes the
frst and last several rows and columns. Frequently, this is not very
informative as the sub-table may contain a lot of missing values
1
Pandas: Python Data Analysis Library. https://pandas.pydata.org/
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
SIGMOD ’22, June 12ś17, 2022, Philadelphia, PA, USA
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9249-5/22/06. . . $15.00
https://doi.org/10.1145/3514221.3520154
SCHEDULED
ARRIVAL
AIR
TIME
WHEELS
OFF
WHEELS
ON
DISTANCE DEPARTURE
TIME
SCHEDULED
DEPARTURE
CANCELLED
2157 242.0 1505.0 2207.0 1979 1448.0 1450 0
1925 NaN NaN NaN 733 NaN 1605 1
1059 31.0 1009.0 1040.0 109 955 1000 0
846 140.0 606.0 826.0 954 9551.0 600 0
1419 77.0 1231.0 1348.0 602 1222.0 1237 0
1920 135.0 1646.0 1901.0 1013 1620.0 1625 0
1643 156.0 1240.0 1616.0 1276 1228.0 1235 0
2050 79.0 1932.0 2051.0 550 1916.0 1916 0
YEAR MONTH DAY DAY OF
WEEK
… SECURITY
DELAY
AIRLINE
DELAY
LATE AIRCRAFT
DELAY
WEATHER
DELAY
2015 1 1 4 … NaN NaN NaN NaN
2015 1 1 4 … NaN NaN NaN NaN
2015 1 1 4 … NaN NaN NaN NaN
… … … … … … … … …
2015 12 31 4 … NaN NaN NaN NaN
2015 12 31 4 … NaN NaN NaN NaN
2015 12 31 4 NaN NaN NaN NaN
SubTab
Pre-Processing
Step:
• binning
• embedding
Selecting Step:
• clustering
• centroid
selection
Query
Result
Query
Raw Dataset (6M Rows X 31 Columns):
Informative Sub-Table (8 Rows X 8 Columns):
Figure 1: System Architecture
and/or fail to capture the range of possible values in a column; it
may also elide columns that are important for further exploration.
Example 1.1. Consider a table taken from the Kaggle fights
dataset
2
which contains 31 columns and ∼6M rows. The analyst is
using to predict fight cancellations, and hence is interested in a
specifc target column, CANCELLED. The analyst starts by visually
inspecting the data using Pandas display( ), which yields the table
displayed at the top of Figure 1. This display of is not informative
for the analysis task, as it does not include the target column. More
crucially, its usefulness for data exploration is limited: e.g., the last
fve columns contain only NaN values, and other columns include
many repetitions of arbitrary values. □
The goal of SubTab is to support the data exploration task by
selecting small, informative sub-tables through which analysts can
view data. Given a table with rows and columns, SubTab creates
a sub-table with << rows and << columns, which is a
subset of rows projected over a subset of columns of .
Intuitively, a sub-table is informative if it captures data patterns
within and across columns in . We formalize this intuition based
on a combination of two complementary metrics: coverage and di-
versity. Coverage, measures how well the sub-table covers patterns
in the data, where we defne patterns through the standard notion
of association rules. For the second metric, diversity, we rely on the
average pair-wise similarity between the rows of the sub-table.
2
https://www.kaggle.com/usdot/fight-delays?select=fights.csv
Demonstration SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2369