Query-Driven Visualization in the Cloud with MapReduce Bill Howe University of Washington billhowe@cs.washington.edu Huy Vo University of Utah hvo@cs.utah.edu Claudio Silva University of Utah csilva@cs.utah.edu 1. INTRODUCTION We explore the MapReduce programming model for massive- scale query-driven visual analytics. Massively parallel program- ming frameworks such as MapReduce are increasingly popular for simplifying data processing on hundreds and thousands of cores, offering fault tolerance, linear scale-up, and a high-level program- ming interface. However, these tools are batch-oriented and are awkward to use directly for visualization. Informed by the success and popularity of MapReduce in the database research community, we evaluate the tradeoffs of using MapReduce to support massive- scale query-driven visualization, where “query" implies not just simple subsetting, but database-style algebraic manipulation. Cloud computing promises an economy of scale for hardware, power, facilities, management, and, increasingly, software by mov- ing computation and data to large, shared data centers. Two cate- gories of cloud computing, Software-as-a-Service (SaaS) and Infra- structure-as-a-Service (IaaS), exemplified by companies such as Force.com and Amazon, respectively, are joined by a third cat- egory, Platform-as-a-Service (PaaS) providing data management, analytics, processing, and visualization services. Most current PaaS offerings are focused on application hosting (c.f., [4]), where a programming environment backed by some form of scalable storage is provided. Increasingly, though, massively parallel data analytics and visualization are needed to accomodate the data avalanche occurring in both commerce and science. One popular tool for parallel data analytics is MapReduce [3], implemented in the open source project Hadoop [5]. The promise of Hadoop/MapReduce is that it significantly simplifies data-intensive scalable computing on thousands of cores, at least for those tasks that can be expressed in a particular way. The framework provides fault tolerance, scheduling, rack-awareness, and limited optimiza- tion in addition to parallel execution. One limitation, however, is latency – Hadoop is primarily a batch processing system. There- fore, Hadoop is not appropriate as an interactive visualization en- gine. Instead, it can be used as just one part of a large-scale parallel "query-driven visualization" system (c.f. [1]). For example, we use Hadoop to prepare the working set that the client interacts with, but jobs are not executed in direct response to user actions. Hadoop jobs may also be fired speculatively to prepare data "nearby" to the user’s current working set. Finally, Hadoop can be used to index or preprocess data to make it more amenable to visualization. To use MapReduce, the programmer implements two functions: map(in_key, in_val) list(out_key, intermediate_val) reduce(out_key, list(intermediate_val)) list(out_val) The intermediate values produced in the map phase are then sorted by out_key, in parallel, and provided to the reduce func- tion as a group. The reduce phase then generates a single output Figure 1: A section of the mouth of the Columbia River Es- tuary colored by salinity with a single streamline illustrating flow. This visualization is query-driven in the sense that a data manipulation step is performed prior to application of visual- ization algorithms. By adopting an algebraic approach, we can optimize and parallelize the data manipulation step indepen- dently of the visualization step. value. The power of this paradigm is that the programmer is only responsible for defining the program semantics over an individual data item; the framework provides the parallelism. This abstraction is not expressive enough for many visualization algorithms. In particular, algorithms requiring recursion or signifi- cant intercommunication between processes are difficult to cast as a MapReduce program. However, preliminary data processing is a natural fit. In this work, we explore the limits of this abstraction for query-driven visualization. 2. ALGEBRAIC QUERY-DRIVEN VIZ The data management community has begun to recognize the need for visual analytics [6], and the visualization community has begun to couple visualization techniques with remote query facil- ities. However, the “query" capabilities in “query-driven visual- ization" systems are generally limited to simple subsetting — the user specifies a region of interest as a “working set", and the sys- tem retrieves it and feeds it into visualization pipeline. We take a database-centric view of query capabilities, and argue that the more computation you can express in the data management layer, the better. Therefore, we are exploring the use of Hadoop for highly- scalable visualization pre-processing in fewer lines of code. Database researchers hold these truths to be self-evident: It is better to move the computation to the data than the data