Use of Dependency Information for Memory Optimizations in Distributed Streaming Applications Nissim Harel , Hasnain A. Mandviwala , Umakishore Ramachandran and Kath Knobe College of Computing, Georgia Institute of Technology, Atlanta, Georgia 30332 Email: {nissim, mandvi, rama}@cc.gatech.edu Intel Corporation, Email: kath.knobe@intel.com Abstract—In this paper we explore the potential of using application data dependency information to reduce the average memory consumption in distributed streaming applications. By analyzing data dependencies during the application runtime, we can infer which data items are not going to influence the appli- cation’s output. This information is then incorporated into the garbage collector, extending the garbage identification problem to include not only data items that are not reachable, but also those data items that are not fully processed and dropped. We present three garbage collection algorithms. Each of the algorithms uses different data dependency information. We implement the algorithms and compare their performance for a color tracker application. Our results show that these algorithms not only succeed in substantially reducing the average memory usage but also improve the overall performance of the application. The results also indicate that the garbage identification algorithms that achieve a low memory footprint perform their garbage identification decisions locally; however, they base these decisions on best-effort global information. The results also indicate that the garbage identification algorithms perform best when they base their decisions on best-effort global information obtained from other components of the distributed application. I. I NTRODUCTION The physical and economic feasibility of capturing and pro- cessing a large number of data streams from different sources in real-time, make it possible to develop and deploy a new class of applications, called streaming applications. Broadly speaking, streaming applications are organized as a series (or a pipeline) of tasks processing streams of data, e.g., starting with sequences of camera images, extracting higher level “features” and “events” at each stage, and eventually responding with outputs. The applications tend to be distributed and involve the processing of large sets of different types of streaming sources of information at near real-time. These applications tend to be distributed and involve processing large sets of different types of streaming inputs at near real-time. The requirement to handle large quantities of data introduces a major challenge to the overall efficiency of streaming applications and makes effective memory and buffer management vital for a successful deployment of these applications. The amount of computing power we currently have still allows us to process only a fraction of all the data captured by the application. Nevertheless, in many cases it is enough to get us sufficiently close to the desired result (i.e., what we would have achieved had we processed all the data). The reason lies in the fact that streaming applications try to attach a meaning to the information they acquire. The goal is not to fully process all the data captured, but rather extract a specific meaning from the data. The data items that are not fully processed are dropped at different stages of the computation and do not have influence on the application’s outcome. The resources (computation, bandwidth, and memory) that are allocated to process these items can, therefore, be considered as wasted and should be minimized. In this paper, we exploit the unique characteristics of streaming applications [1], and explore the potential of us- ing inter-stream data dependency information to identify and reclaim wasted resources as early as possible. In a broader context, we propose extending the definition of the garbage collection problem in streaming applications to include not only data items that are not “reachable” by the application’s threads, but also data items that have no effect on the final outcome of the application. Each one of the algorithms we propose applies a different method to identify items that can be considered as garbage. The algorithms analyze the allocated data items and use data dependency information to infer those items that would not be requested by any of the application threads. These items are denoted as garbage and reclaimed. The proposed memory optimization algorithms are evalu- ated using the Stampede runtime system [2], which serves as a test bed for this study. Stampede is a runtime system that supports the development and execution of streaming applications by handling communication, synchronization, and buffer management, in-turn directing the application writer’s attention away from these arduous and repeated tasks. The programming model of Stampede is simple and intuitive. A Stampede program consists of a dynamic collection of threads communicating timestamped data items through channels. Threads can be created to run anywhere in the cluster. Chan- nels can be created anywhere in the cluster and have cluster- wide unique names. Threads can connect to these channels to perform input/output via get/put operations. The facility through which channels and threads communicate is called a connection. A timestamp value is used as a name for a data item that a thread puts into or gets from a channel. The runtime system of Stampede takes care of the synchronization and communication inherent in these operations, as well as managing the storage for items put into or gotten from the channels. The remainder of the paper is organized as follows. In 1095-2055/07/$25.00 ©2007 IEEE