Noname manuscript No. (will be inserted by the editor) A Cost-based Storage Format Selector for Materialized Results in Big Data Frameworks Rana Faisal Munir 1,2 · AlbertoAbell´o 1 · Oscar Romero 1 · Maik Thiele 2 · Wolfgang Lehner 2 Received: date / Accepted: date Abstract Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying Data- Intensive Workows (DIWs). These DIWs of dierent users share many com- mon tasks (i.e, 50-80%), which can be materialized and reused in future exe- cutions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current so- lutions for materialization store data on Distributed File Systems by using a xed storage format. However, a xed choice is not the optimal one for every situation. Specically, dierent layouts (i.e., horizontal, vertical or hy- brid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based frame- work that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for spe- cic Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33x Rana Faisal Munir E-mail: fmunir@essi.upc.edu AlbertoAbell´o E-mail: aabello@essi.upc.edu Oscar Romero E-mail: oromero@essi.upc.edu Maik Thiele E-mail: maik.thiele@tu-dresden.de Wolfgang Lehner E-mail: wolfgang.lehner@tu-dresden.de 1 Universitat Polit` ecnica de Catalunya (UPC), Barcelona, Spain 2 Technische Universit¨at Dresden (TUD), Dresden, Germany This is a post-peer-review, pre-copyedit version of an article published in Distributed and parallel databases. The final authenticated version is available online at: http://dx.doi.org/10.1007/s10619-019-07271-0