Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva †♠ , Zbigniew Kalbarczyk † , Bu˘ gra Gedik ♠ , Henrique Andrade ♠‡ , Kun-Lung Wu ♠ , Ravishankar K. Iyer † † Coordinated Science Laboratory University of Illinois at Urbana-Champaign {kalbarcz,rkiyer}@illinois.edu ♠ Thomas J. Watson Research Center IBM Research {g.jacques,bgedik,klwu}@us.ibm.com Abstract— This paper describes a modeling framework for evaluating the impact of faults on the output of streaming applications. Our model is based on three abstractions: stream operators, stream connections, and tuples. By composing these abstractions within a Stochastic Activity Network, we allow the modeling of complete applications. We consider faults that lead to data loss and to silent data corruption (SDC). Our framework captures how faults originating in one operator propagate to other operators down the stream processing graph. We demonstrate the extensibility of our framework by evaluating three different fault tolerance techniques: checkpointing, partial graph replication, and full graph replication. We show that under crashes that lead to data loss, partial graph replication has a great advantage in maintaining the accuracy of the application output when compared to checkpointing. We also show that SDC can break the no data duplication guarantees of a full graph replication-based fault tolerance technique. I. I NTRODUCTION Stream processing applications continuously process mul- tiple sources of live data (e.g., audio and business feeds), analyze them on-the-ﬂy, and generate results. Examples of these applications include algorithmic trading, fraud detection, and health monitoring systems. Streaming applications are assembled as dataﬂow graphs, where each vertex of the graph is a stream operator, and each edge is a stream connection. To achieve high performance, stream operators can run across different nodes of a distributed system. In this environment, a fault in a stream operator can result in massive data loss or in the generation of inaccurate results. Fault tolerance techniques must be used to achieve applica- tion resiliency to errors. To understand the beneﬁts of applying a certain fault tolerance technique to a streaming application, it is critical to evaluate its effect on the application output, especially considering the differing resource consumption and performance impact of alternative techniques [1], [2], [3], [4]. Previous research on the evaluation of fault tolerance tech- niques for streaming applications has mostly focused on their performance overhead [2], [4]. Our earlier work [5] proposes the evaluation of the impact of faults on the application output via fault injection. While fault injection can be applied directly This work was supported by an IBM PhD Fellowship (awarded to Gabriela Jacques-Silva) and IBM-UIUC Open Collaborative Research Project. ‡Currently employed by Goldman Sachs. . . Email: henrique.c.m.andrade@gmail.com to the real system and get accurate results, it can be very time- consuming and expensive to deploy, especially if we consider that operators can fail concurrently. In this paper, we describe a modeling framework to evaluate the dependability provided by different fault tolerance tech- niques under varying fault models. The framework allows us to compare the relative merits of different techniques, so that the user can determine which technique performs best for an application at hand. The framework considers faults that lead to data loss and data corruption. To the best of our knowledge, we are the ﬁrst to consider the problem of data corruption in streaming applications. In addition, our framework considers the consequences of error propagation, i.e., the impact that a fault at one stream operator can have on the downstream operators and on the application output. This is an important problem that also has not been addressed by the research community. The developed framework is based on generic models speciﬁed with the Stochastic Activity Network (SAN) formal- ism [6]. One of the main innovations of our approach is to provide SAN-based abstractions for the key components of a streaming application: stream operators, stream connections, and tuples. By assembling these components, we represent the complete dataﬂow graph of the target application as a SAN. Furthermore, we devise techniques to capture the error propagation behavior of various fault models in the SAN representation of a streaming application, making it possible to evaluate the dependability achieved by various fault tolerance techniques under different fault models. The framework is used to evaluate the effectiveness of three different fault tolerance techniques, namely checkpointing [3], high-availability groups [5], and full replication [2]. Our ex- periments with faults that cause data loss show that high avail- ability groups have an advantage in maintaining the accuracy of the application output when compared to checkpointing. Our results also indicate that faults that lead to data corruption can break the no data duplication guarantee provided by the modeled full replication technique. We evaluate the accuracy of our approach by comparing the results obtained by running a target application in the proposed framework and in System S - a stream processing middleware developed at IBM Research. The main contributions of this paper are (i) a framework with generic models to compose streaming applications and