Squall: Scalable Real-time Analytics Aleksandar Vitorovic, Mohammed Elseidy, Khayyam Guliyev, Khue Vu Minh, Daniel Espino, Mohammad Dashti, Yannis Klonatos and Christoph Koch {ﬁrstname}.{lastname}@epﬂ.ch ´ Ecole Polytechnique F´ ed´ erale de Lausanne ABSTRACT Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Squall builds on state-of-the-art partitioning schemes and local algorithms, including some of our own. This paper presents the overview of Squall, including some novel join operators. The paper also presents lessons learned over the ﬁve years of working on this system, and outlines the plan for the proposed system demonstration. 1. INTRODUCTION Online processing implies that results are incrementally built as the input arrives. Thus, each input tuple produces output and updates the system state necessary for process- ing subsequent inputs. Online processing is ubiquitous for many applications such as algorithmic trading, clickstream analysis and business intelligence (e.g., in order to reach a potential customer during the active session). Skew occurs frequently in real-life datasets. For instance, certain types of skewed distributions (such as zipﬁan distri- bution) appear in Internet packet traces, city sizes, word fre- quency in natural languages and advertisement clickstreams [17]. Existing open-source online systems (e.g., Twitter’s Storm [49], Spark Streaming [73], Flink [14] 1 ) focus on dis- tribution primitives (e.g., communication patterns, fault tol- erance) and low-level performance optimizations. However, these systems provide only vanilla database operators, such as hash-based equi-joins (and general UDFs), which do not perform well in the case of skew (see §3.1). Regarding non- equi joins, Storm do not provide them. Whereas, Spark Streaming and Flink execute non-equi joins very ineﬃciently (a Cartesian product followed by a selection). On the other hand, existing partitioning schemes that support both equi- joins and non-equi joins (e.g., [54]) have the following draw- backs. First, they work eﬃciently only for a narrow set of 1 Flink provides both oﬄine and online processing, but in this paper we discuss only the online case. This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Proceedings of the VLDB Endowment, Vol. 9, No. 10 Copyright 2016 VLDB Endowment 2150-8097/16/06. data distribution properties. Second, these schemes are de- signed for oﬄine processing, and thus, they are unable to adapt to changing data statistics (see §5). Squall addresses there problems. In contrast, Squall is a system that puts together state- of-the-art partitioning schemes, local query operators, and techniques for scalable online query processing. We also build novel 2-way [32, 66] and multi-way schemes (Hybrid- Hypercube, see §3.1). Such a system allows us to leverage the eﬀect of various design choices on the performance, and to seamlessly build eﬃcient novel operators (see §3). Squall operators achieve skew-resilience, adaptivity and scalability. Squall is an open-source project 2 that has been developed for the last ﬁve years (mainly by the authors at EPFL, but also with external contributions). It has been available for several years, and it has attracted a community of users. 2. SYSTEM ARCHITECTURE Squall is an online distributed query engine which achieves low latency and high throughput. It supports full-history (incremental view maintenance) and window (stream) se- mantics. Squall uses Storm [49] as a distribution and paral- lelization platform. The overall system architecture is shown in Figure 1. Next, we give an overview of various Squall concepts. User interface. Squall oﬀers multiple interfaces: declara- tive (SQL), functional (a modern Scala collections API), in- teractive (Scala) and imperative (Java). Similarly to Hive which provides an SQL interface on top of Hadoop, Squall’s declarative interface oﬀers running SQL over Storm. Squall’s functional interface provides for compositions of data trans- formations over streams. Squall also provides interactive interface built on top of the Scala REPL (Read-Eval-Print Loop) that allows a user to interactively and run construct query plans. For each of these three interfaces, Squall translates the user input to a logical query plan (see Figure 1). Finally, the imperative interface gives the user full control over the physical query plan. A user can run a query plan speciﬁed by any Squall interfaces either locally or on a cluster, making it easy to learn and test Squall. Logical and Physical query plans. A logical Squall qu- ery plan is a DAG of relational algebra operators. A physical Squall query plan consists of a DAG of physical operators and their requested level of parallelism. A physical operator is speciﬁed by the partitioning scheme and local algorithm. To minimize the number of network hops, and thus to maxi- mize the performance, we co-locate the connected operators that employ the same partitioning scheme. We denote a 2 https://github.com/epﬂdata/squall/