Wide Table Layout Optimization based on Column Ordering and Duplication Haoqiong Bian 1 , Ying Yan 2∗ , Wenbo Tao 3 , Liang Jeff Chen 2 , Yueguo Chen 1∗ , Xiaoyong Du 1 , Thomas Moscibroda 2 1 DEKE Key Lab (MOE), Renmin University of China; 2 Microsoft Research; 3 MIT 1 {bianhq, chenyueguo, duyong}@ruc.edu.cn, 2 {ying.yan, jeche, moscitho}@microsoft.com, 3 wenbo@mit.edu ABSTRACT Modern data analytical tasks often witness very wide tables, from a few hundred columns to a few thousand. While it is commonly agreed that column stores are an appropriate data format for wide tables and analytical workloads, the physical order of columns has not been investigated. Column ordering plays a critical role in I/O performance, because in wide tables accessing the columns in a single horizontal partition may involve multiple disk seeks. An optimal column ordering will incur minimal cumulative disk seek costs for the set of queries applied to the data. In this paper, we aim to ﬁnd such an optimal column layout to maximize I/O performance. Speciﬁcally, we study two problems for column stores on HDFS: column ordering and column duplication. Column ordering seeks an approximately optimal order of columns; column duplication com- plements column ordering in that some columns may be duplicated multiple times to reduce contention among the queries’ diverse re- quirements on the column order. We consider an actual ﬁne-grained cost model for column accesses and propose algorithms that take a query workload as input and output a column ordering strategy with or without storage redundancy that signiﬁcantly improves the overall I/O performance. Experimental results over real-life data and production query workloads conﬁrm the effectiveness of the proposed algorithms in diverse settings. 1. INTRODUCTION The challenge of Big Data has shifted the design of data analytical systems from single machines to large-scale distributed systems. While there are many distributed data analytical systems in the market and their runtime specialties vary greatly, they all share a common core as their underlying storage engine: HDFS (Hadoop Distributed File System). In HDFS, a common data modeling is to represent data as two-dimensional tables. A table is horizontally partitioned to scale out and to leverage multi-machine parallelism. At runtime, each table partition, a.k.a. a row group, is read and processed individually by mappers in each machine. The physical layout of a row group plays a fundamental and critical role in system I/O performance [27, 34, 29, 30]. Existing ∗ Yueguo Chen and Ying Yan are the corresponding authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD’17, May 14-19, 2017, Chicago, Illinois, USA. c  2017 ACM. ISBN 978-1-4503-4197-4/17/05. . . $15.00 DOI: http://dx.doi.org/10.1145/3035918.3035930 studies have focused on two aspects in organizing data in a row group: i) column store vs. row store and ii) row group size. A row- oriented store serializes a row as a blob and allows an application to retrieve one row at a time. While this data format is common in conventional databases, modern data analytical systems on HDFS have primarily opted for column stores, e.g., RCFile [29], ORC File [5] and Parquet [3]. A column store serializes a whole column in a row group as a blob. This is good for I/O performance for two reasons: First, many queries only access a small number of columns, even though the underlying table is very wide; second, storing a column in its entirety facilitates compression. It has been shown that when the row group size is sufﬁciently large, the additional cost incurred by row reconstruction can be amortized and is eventually almost negligible [30]. In practice, the row group size is set to a value that ensures massively parallel data processing while still keeping re-do cost reasonable in case of a mapper/reader failure. In this paper, we study a new dimension of organizing data in a row group: column ordering. Column ordering in a column store speciﬁes how columns are physically ordered so that two adjacent columns can be accessed with sequential reads. Column ordering is not an important issue when a table only has dozens of columns. However, in Bing Microsoft, data analysis pipelines commonly feature very wide tables with thousands of columns. These tables are either raw logs or products of cooking logs to facilitate other analysis jobs down the pipeline. Very wide tables are common in modern data analytical systems, because of their advantages in analytical processing compared to normalizing/partitioning data into less wide tables (e.g. [36]). In one speciﬁc instance, an in- production table contains 1187 columns and grows at a pace of 5TB per day. For such a wide table, ﬁnding the right column order can lead to dramatic differences in performance. To understand the beneﬁts of column ordering, we select three frequently-executed queries from Bing Microsoft’s data analytical pipeline, namely A, B and C, following the query template SELECT coli 1 , coli 2 ,... FROM TargetTable [WHERE cond] [GROUP BY colj 1 , colj 2 ,...] [ORDER BY col k 1 , col k 2 ,...] and measure their performance on a column store with different column orders. We execute the queries on the default 1187-column table (unordered) in the actual production environment, as well as on its duplicate whose columns are manually ordered so that columns accessed by the queries are physically close. The results show that column ordering reduces end-to-end query execution time by up to 70%. These results are understandable. In this workload, disk I/O is the main cost of query execution and disk seek is a dominant factor in disk I/O. If columns accessed by a query are physically placed 299