Nondeterminism in MapReduce Considered Harmful? An Empirical Study on Non-commutative Aggregators in MapReduce Programs Tian Xiao 1,2 Jiaxing Zhang 2 Hucheng Zhou 2 Zhenyu Guo 2 Sean McDirmid 2 Wei Lin 3 Wenguang Chen 1 Lidong Zhou 2 1 Tsinghua University, China 2 Microsoft Research, China 3 Microsoft Bing, USA ABSTRACT The simplicity of MapReduce introduces unique subtleties that cause hard-to-detect bugs; in particular, the unfixed order of reduce func- tion input is a source of nondeterminism that is harmful if the reduce function is not commutative and sensitive to input order. Our exten- sive study of production MapReduce programs reveals interesting findings on commutativity, nondeterminism, and correctness. Al- though non-commutative reduce functions lead to five bugs in our sample of well-tested production programs, we surprisingly have found that many non-commutative reduce functions are mostly harm- less due to, for example, implicit data properties. These findings are instrumental in advancing our understanding of MapReduce program correctness. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging; D.1.3 [Programming Techniques]: Concurrent Programming General Terms Languages, Reliability Keywords MapReduce, nondeterminism, commutativity, bug 1. INTRODUCTION MapReduce [5] has emerged as the main programming model for data-parallel computation given its simple programming model of mappers and reducers that enable parallel failure-resilient execution on many machines. There is however a significant gulf between a static MapReduce program and its execution when we reason about correctness. For example, it is well-known that nondeterministic user-defined mappers and reducers will produce different results when re-executed in response to failures [5]. Another subtlety important to the correctness of MapReduce programs is nondeterminism in data shuffling that occurs between map and reduce stages. When a MapReduce program executes on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE ’14, May 31 – June 7, 2014, Hyderabad, India Copyright 2014 ACM 978-1-4503-2768-8/14/05 ...$15.00. a cluster of machines, mappers execute concurrently on a set of machines over their data partitions. Keyed data entries produced by mappers are exchanged via data shuffling to machines where reducers run so that data items with the same key are aggregated in a sequence to the same reducer. Due to uncertainties in the number of mappers/reducers, network latency, and scheduling decisions, the order of each sequence is nondeterministic. A reduce function that is not commutative might produce different results depending on different sequence orders, which can lead to correctness violations. This problem has not gone unnoticed by the software engineering research community. For example, Csallner et al. [3] proposes that non-commutative reducers are bugs that can be detected through symbolic execution. In practice, programmers that write reduc- ers are usually not aware of commutativity, and it remains largely unknown if non-commutative reducers are a serious issue to correct- ness. We have therefore conducted the first ever empirical study on the commutativity of real-world user-defined reducers in production MapReduce-style programs to answer the following questions: • How pervasive are non-commutative reducers in real-world MapReduce programs? • How does the output of a non-commutative reducer depend on input order? Are there any common patterns? • Are non-commutative reducers always harmful? Is it appro- priate to flag them as bugs? • Are there real bugs caused by non-commutative reducers? If so, what do they look like and what is their impact? Our study has collected 507 distinct custom user-defined reducers found in 13,311 real-world MapReduce-style jobs in our production cluster. We studied reducer code manually to identify those that are non-commutative with findings, summarized in Table 1, that are quite surprising. Non-commutative reducers not only exist, but are pervasive: 58% of the reducers examined are non-commutative. More importantly, our investigation indicates that most of those non- commutative reducers do not lead to correctness issues. Flagging non-commutative reducers as bugs, as proposed in [3], is then likely to create many false positives that will frustrate programmers. Our further investigation reveals that surprisingly most (88%) of the non-commutative reducers can be categorized into five sim- ple patterns even though they encode a wide variety of algorithms in different coding styles. For some patterns, non-commutative reducers lead to nondeterministic results, but the nondeterminism appears known and tolerated by programmers. For other patterns, non-commutative reducers are guaranteed to produce deterministic results as long as the data they operate on has certain properties.