ATUN-HL: Auto Tuning of Hybrid Layouts using Workload and Data Characteristics Rana Faisal Munir 1,2 , Alberto Abell´o 1 , Oscar Romero 1 , Maik Thiele 2 , and Wolfgang Lehner 2 1 Universitat Polit`ecnica de Catalunya (UPC), Barcelona, Spain {fmunir,aabello,oromero}@essi.upc.edu 2 Technische Universit¨ at Dresden (TUD), Dresden, Germany {maik.thiele,wolfgang.lehner}@tu-dresden.de Abstract. Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations. Horizontal partition and dictionary sizes of hybrid layouts are conﬁg- urable and can directly impact the performance of analytical queries. Hence, their default conﬁguration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hy- brid Layouts), which based on a cost model and given the workload and the characteristics of data, ﬁnds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its eﬀectiveness. Our experimental evaluation shows that ATUN- HL provides on average 85% of all the potential performance improve- ment, and 1.2x average speedup against default conﬁguration. Keywords: Big data, Hybrid storage layouts, Auto tuning, Parquet 1 Introduction Data analysis plays a decisive role in todays data-driven organizations, which increasingly produce and store large volumes of data in the order of petabytes to zettabytes [16]. The storage and processing of such data has imposed a shift in the hardware, from single machines to large scale distributed systems. Apache Hadoop 3 is a pioneer large-scale distributed system and consists of a storage layer, namely Hadoop Distributed File System (HDFS) 4 , and a processing layer, namely MapReduce[6]. The former allows to keep data in raw format without any 3 https://hadoop.apache.org 4 https://hadoop.apache.org/docs/r1.2.1/hdfs design.html