Parallelizing General Histogram Application for CUDA Architectures Ugljesa Milic *† , Isaac Gelado * , Nikola Puzovic * , Alex Ramirez *† and Milo Tomasevic ‡ * Barcelona Supercomputing Center Centro Nacional de Supercomputacion, Barcelona, Spain Email: {first.last}@bsc.es † Universitat Politecnica de Catalunya, Barcelona, Spain ‡ School of Electrical Engineering, University of Belgrade, Belgrade, Serbia Email: mvt@etf.rs Abstract—Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA- capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters. I. I NTRODUCTION Histogramming is one of the basic statistic tools used in data analysis [1]. Strictly speaking, a histogram is a function that counts the number of observations (elements of the input) that fall into each of disjoint categories (bins). A histogram is commonly represented as an array where each element corresponds to one of the bins and contains the number of input elements that fall into it. Bins are defined by their starting element and widths that are chosen so that different bins do not overlap. When an input element is found to fit into that range, the value of the bin will increment by one. In this way, a histogram approximates the probability density function for a given input set. Histograms are widely used in data mining, image analysis, pattern recognition, data presenting and data analysis in general. The calculation of a histogram is straightforward when it is done in a sequential way. There are also a wide range of par- allel implementations of histograms for multi-core, SMP, and NUMA machines. These implementations have served as the basis for the implementation of histogramming algorithms on GPUs using CUDA (Compute Unified Device Architecture) [2]. However, GPUs pose several challenges to achieve an efficient and scalable execution. For instance, the usage of atomic instructions greatly harms the performance in GPUs because it can potentially serialize the execution of all threads in a warp that, otherwise, would be executed in parallel. However, if the histogram is being computed for very sparse data that results in small bin counts, such a conflict on concurrent threads seldom happens and, therefore, the usage of atomic instructions might be quite efficient. There are two main strategies to produce a parallel imple- mentation of dense histograms with an arbitrary number of bins, bin width, and data type. The first approach is based on decomposing the input set in many domains that are indi- vidually computed by each thread block (i.e., privatization). Each of these individual histograms is used to update the output histogram when all threads in the block have finished. This approach requires an extensive use of atomic operations both in shared and global memory to produce the private histogram and to update the final output respectively. Hence, this implementation might be quite inefficient when computing histograms with few bins and/or very sparse input data. We also explore a different implementation where the input data is first sorted. After that, positions of sorted elements are found according to the upper bounds of bin widths. This implemen- tation completely avoids the usage of atomic instructions, but requires a sorting stage that might introduce large overheads. In this paper, we present an experimental analysis of the different trade-offs of each of these implementations. The main contributions of this paper are: • Analysis of the state-of-the-art algorithms for imple- menting histrogramming on GPU architectures. • Implementation of the general purpose histrogram- ming algorithms that can operate with any data type or bin widths, and with any given size of the input and output arrays. • Analysis of trade-offs for tuning the performance of algorithms for GPUs, and design space exploration for determining the best algorithm for a given set of input parameters.