IPSO: A Scaling Model for Data-Intensive Applications Zhongwei Li, Feng Duan, Minh Nguyen, Hao Che, Yu Lei and Hong Jiang Department of Computer Science & Engineering University of Texas at Arlington Arlington, U.S. Email: zhongwei.li@mavs.uta.edu, feng.duan@mavs.uta.edu, mqnguyen@mavs.uta.edu, hche@uta.edu, ylei@uta.edu, hong.jiang@uta.edu Abstract—Today’s data center applications are predominantly data-intensive, calling for scaling out the workload to a large number of servers for parallel processing. Unfortunately, the existing scaling laws, notably, Amdahl’s and Gustafson’s laws are inadequate to characterize the scaling properties of data- intensive workloads. To fill this void, in this paper, we put forward a new scaling model, called In-Proportion and Scale-Out-induced scaling model (IPSO). IPSO generalizes the existing scaling models in two important aspects. First, it accounts for the possible in-proportion scaling, i.e., the scaling of the serial portion of the workload in proportion to the scaling of the parallelizable portion of the workload. Second, it takes into account the possible scale- out-induced scaling, i.e., the scaling of the collective overhead or workload induced by scaling out. IPSO exposes scaling properties of data-intensive workloads, rendering the existing scaling laws its special cases. In particular, IPSO reveals two new pathological scaling properties. Namely, the speedup may level off even in the case of the fixed-time workload underlying Gustafson’s law, and it may peak and then fall as the system scales out. Extensive MapReduce and Spark-based case studies demonstrate that IPSO successfully captures diverse scaling properties of data- intensive applications. As a result, it can serve as a diagnostic tool to gain insights on or even uncover counter-intuitive root causes of observed scaling behaviors, especially pathological ones, for data-intensive applications. Finally, preliminary results also demonstrate the promising prospects of IPSO to facilitate effective resource provisioning to achieve the best speedup-versus- cost tradeoffs for data-intensive applications. Index Terms—scale-out workload, cloud computing, speedup, performance evaluation, Amdahl’s Law, Gustafson’s Law I. I NTRODUCTION Predominant applications in today’s datacenters are data- intensive and scale-out by design, based on, e.g., MapReduce [1], Spark [2], and Dryad [3] programming frameworks. For such applications, job execution may involve one or multiple rounds of parallel task processing with massive numbers of tasks and the associated data shards being scaled out to up to tens of thousands of low-cost commodity servers, followed by a serial (intermediate) result merging process. Clearly, from both user’s and datacenter provider’s perspective, it is imperative to gain good understanding of the scaling properties of such applications so that informed datacenter resource provisioning decisions can be made to achieve the best speedup-versus-cost tradeoffs. Unfortunately, however, the existing scaling laws that have worked well for paral- lel, high-performance computing, such as Amdahl’s law [4], Gustafson’s law [5], and Sun-Ni’s law [6], are no longer ad- equate to characterize the scaling properties of data-intensive workloads for two reasons. First and foremost, the traditional scaling models underlying these laws are exclusively focused on the scaling of the parallelizable portion of the workload or external scaling (e.g., the fixed-size, fixed-time, and memory-bounded external scaling models underlying Amdahl’s, Gustafson’s, and Sun- Ni’s laws, respectively), leaving the scaling of the serial portion of the workload or internal scaling a constant. Fig. 1 illustrates this, i.e., scaling out to three parallel processing units for the Amdahl’s model in Fig. 1(b) and Gustafson’s or Sun-Ni’s model in Fig. 1(c) from the sequential execution case in Fig. 1(a). While the parallelizable portion of the workload stays unchanged (i.e., fixed-size) and grows by three times (i.e., fixed-time or memory-bounded), respectively, the serial portion of the workload remains unchanged. The rationale behind this assumption is the understanding that the serial portion of a program mostly occurs in the initialization phase of the program, which is independent of the program size [5]. This assumption, however, no longer holds true for data- intensive workloads. This is because as the parallelizable portion of a data-intensive workload increases, so does the serial portion of the workload in general. In other words, the (intermediate) results to be merged in each round of the job execution are likely to grow, in proportion to the external scaling, referred to as in-proportion scaling in this paper. Second, the existing scaling models do not take the possible scale-out-induced scaling into account, i.e., the scaling of the collective overhead or workload induced by the external scaling. As being widely recognized (see Section 2 for details), for data-intensive applications, such workloads cannot be neglected in general and they may be induced for various reasons, e.g., task dispatching, data broadcasting, reduction operation, or any types of resource contentions among parallel tasks. Both in-proportion scaling and scale-out-induced scaling are responsible for the scalability challenges facing today’s programming frameworks, such as Hadoop and Spark [7]. To overcome the above inadequacies of the existing scaling models, in this paper, we put forward a new scaling model, referred to as In-Proportion and Scale-Out-induced scaling model (IPSO). IPSO augments the traditional scaling models