When Merging and Branch Predictors Collide Oded Green Georgia Institute of Technology Atlanta, Georgia Abstract—Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch prediction accuracy go hand-in-hand, these have not been studied in the context of merging. We thoroughly test merging using multiple input array sizes and values using the same code and compile optimizations. As the number of possible keys increase, so the do the number of branch mis-predictions - resulting in reduced performance. The reduction in performance can be as much as 5X. We explain this phenomenon using a visualization technique called Merge Path that intuitively shows this. We support this visualization approach with modeling, thorough testing, and analysis on multiple systems. Index Terms—Performance evaluation; Performance analysis; Merging; Sorting; I. I NTRODUCTION Merging sorted arrays is a building block for many algo- rithms and applications including sorting, database query joins, and graph contractions. Merging is similar to set intersection which has many applications found in graph theory including for computing clustering-coefﬁcients [1]. In this work, we show that the merge algorithm and branch predictors “do not get along” in many cases and that this conﬂict signiﬁcantly reduces performance. When the branch predictor is considerably accurate, the performance is good due to out of order execution which is supported by many modern architectures. However, when the branch prediction accuracy rate is low, the performance goes down signiﬁcantly. We show that merging, which seems like a simple and straightforward algorithm is in fact highly data-dependent. This data dependency causes the branch predictor (on multiple systems) to predict with a low level of certainty the outcome of the branch and in practice makes off-line analysis signiﬁcantly more challenging. One outcome of our work suggests that load-balancing of merging is more challenging than was previously thought: even if each core receives an equal amount of work (i.e. the same number of elements to merge), one core can become the execution bottleneck is it may take upto 4X- 5X more time to merge the same number of elements due to branch mis-predictions. The actual performance for a parallel merge will be based on the distribution of the keys. The reader is referred to [2], [3], [4], [5], [6] for additional reading on parallel merge algorithm. Note, that none of these take into consideration the number of different keys in the input. The key contribution of this work is to explain the why the operation of merging of two different inputs can take a different amount of time despite having an equal number of comparisons. While this might seem straightforward, the results are surprising and offer new insights for creating better merging and sorting algorithms. We show that as the level of “randomness” increases so do the number of branch misses increases which in turn reduces the overall performance. While the focus of this work is to explain why merging and branch predictors do not get along, we also introduce a simple workaround the branch-predictor - a branch-avoiding merging algorithm. This branch-avoiding algorithm is immune to the randomness of the data as it is not dependent on the result of the branch predictor. The approach that we present in this paper uses additional arithmetical operations. We extend the discussion in future sections. In this work, we study the impact of branch prediction with the help of Merge Path[4], [5] - a visual approach to merging. Using Merge Path, we can simplify the performance analysis of merge. The remainder of this section introduces Merge Path and explains the signiﬁcance of the input’s key distribution. Section 2 discusses related works. Section 3 presents our experimental setup and empirical results. In Section 4 we present some ﬁnal thoughts. A. Problem Statement Given two sorted arrays A and B of length |A| and |B|, respectively, the output of a merge is the array C of length |C| such that C is sorted and made up of all the elements of A and B where |C| = |A| + |B|. Simpliﬁed pseudo code of the algorithm for doing a merge can be found in Alg. 1 and an extended discussion can be found in [7]. B. Merge Path Merge Path [5], [4] restates the merging operation to be considered as a traversal of a 2D grid of size |A|×|B|. This grid is known as the Merge Path matrix. The traversal starts at the top-left corner and ﬁnishes at the bottom right-corner, Fig. 1(a). Array A has been placed as a column stack on the left of the matrix and array B has been placed over the top of the matrix. The only legal moves are to the right (when B[b i ] <A[a i ]) or downwards (when B[b i ] ≥ A[a i ]), where a i and b i are the indices of the elements being merged in both arrays. The decision per move is based on the values of A and B at the given indices. Essentially, the order in which elements are merged is equivalent to the traversal of the path. The path is discovered sequentially in the process of the merge. We will further discuss the use of the path in the context of the number of possible keys in the following sections. Note, that Merge