Sorting Large Multifield Records on a GPU* Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@ufl.edu, sahni@cise.ufl.edu Abstract—We extend the fastest comparison based (sample sort) and non-comparison based (radix sort) number sorting algorithms on a GPU to sort large multifield records. Two extensions - direct (the entire record is moved whenever its key is to be moved) and indirect ((key,index) pairs are sorted using the direct extension and then records are ordered according to the obtained index permutation) are discussed. Our results show that for the ByField layout, the direct extension of the radix sort algorithm GRS [1] is the fastest for 32-bit keys when records have at least 12 fields ; otherwise, the direct extension of the radix sort algorithm SRTS [13] is the fastest. For the Hybrid layout, the indirect extension of SRTS is the fastest. Index Terms—Graphics Processing Units, sorting multifield records, radix sort, sample sort. I. I NTRODUCTION Graphics Processing Units (GPUs) are fast becoming an essential component of desktop computers. Cheap prices and massively parallel computation capability make them a viable choice for desktop supercomputing in addition to accelerating games and other graphics intensive tasks. From the view of general purpose computation, GPUs are manycore processors capable of running thousands of threads with very little context switching overhead. NVIDIA’s Tesla GPUs come with 240 scalar processing cores (SPs) [14], organized into 30 Streaming multiprocessors (SM) each having 8 SPs. Each SM has a 16 KB fast shared memory that is shared among the threads running on that SM. There is also a vast register file comprising of 16384 32-bit registers that are used to store local variables of threads and states of numerous threads for context switching purposes. Being a graphics processor, each SM also includes texture caches to make fast texture look- up. The GPU also has a small read only constant memory. Each Tesla GPU comes with a 4GB off-chip global (or device) memory. Figure 1 gives the Tesla architecture. GPUs can now be programmed using general purpose languages such as C with Application Programming Interfaces (APIs) like OpenCL or an Nvidia specfic C extension known as Compute Unified Driver Architecture (CUDA) [23]. One of the very first GPU sorting algorithms, an adaptation of bitonic sort, was developed by Govindraju et al. [6]. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. * This research was supported, in part, by the National Science Foundation under grants 0829916 and NETS 0963812. The authors acknowledge the University of Florida High-Performance Computing Center for providing computational resources and support that have contributed to the research results reported within this paper. URL: http://hpc.ufl.edu. Zachmann et al. [7] improved on this sort algorithm by using BitonicT rees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. [5] have adapted quick sort for GPUs. Their adaptation first partitions the sequence to be sorted into subsequences, sorts these subsequences in parallel, and then merges the sorted subsequences in parallel. A hybrid sort algorithm that splits the data using bucket sort and then merges the data using a vectorized version of merge sort is proposed by Sintron et al. [18]. Satish et al. [16] have developed an even faster merge sort The fastest GPU merge sort algorithm known at this time is Warpsort [21]. Warpsort first creates sorted sequences using bitonic sort; each sorted sequence being created by a thread warp. The sorted sequences are merged in pairs until too few sequences remain. The remaining sequences are partitioned into subsequences that can be pairwise merged independently and finally this pairwise merging is done with each warp merging a pair of subsequences. Experimental results reported in [21] indicate that Warpsort is about 30% faster than the merge sort algorithm of [16]. Another comparison-based sort for GPUs–GPU sample sort–was developed by Leischner et al. [12]. Sample sort is reported to also be about 30% faster than the merge sort of [16], on average, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [16]. [17], [22], [11], [16], [13] have adapted radix sort to GPUs. Radix sort accomplishes the sort in phases where each phase sorts on a digit of the key using, typically, either a count sort or a bucket sort. The counting to be done in each phase may be carried out using a prefix sum or scan [4] operation that is quite efficiently done on a GPU [17]. Harris et al.’s [22] adaptation of radix sort to GPUs uses the radix 2 (i.e., each phase sorts on a bit of the key) and uses the bitsplit technique of [4] in each phase of the radix sort to reorder records by the bit being considered in that phase. This implementation of radix sort is available in the CUDA Data Parallel Primitive (CUDPP) library [22]. For 32-bit keys, this implementation of radix sort requires 32 phases. In each phase, expensive scatter operations to/from the global memory are made. Le Grand et al. [11] reduce the number of phases and hence the number of expensive scatters to global memory by using a larger radix, 2 b , for b> 0. A radix of 16, for example, reduces the number of phases from 32 to 8. The sort in each phase is done by first computing the histogram of the 2 b possible values that a digit with radix 2 b may have. Satish et al. [16]