Abstract— The GPUs are emerging as a general-purpose high-performance computing device. Growing GPGPU research has made numerous GPGPU workloads available. However, a systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture design evaluation is still lacking. In this research, we propose a set of microarchitecture agnostic GPGPU workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis are used to understand these workloads. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (e.g. memory coalescing and branch divergence). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction. I. INTRODUCTION With the increasing numbers of cores per CPU chip, the performance of microprocessors has increased tremendously over the past few years. However, data-level parallelism is still not well exploited by general-purpose chip multiprocessors for a given chip area and power budget. With hundreds of in-order cores per chip, GPU provides performance throughput on data parallel and computation intensive applications. Therefore, a heterogeneous microarchitecture, consisting of chip multiprocessors and GPUs seems to be good choice for data parallel algorithms. Nvidia CUDA™ [19], AMD stream™ [23] and OpenCL [49] programming abstractions have provided data parallel application development thrust by reducing significant amount of development effort. Emerging CPU-GPU heterogeneous multi-core processing has motivated the computer architecture research community to study various microarchitectural designs, optimizations, and analysis for GPU, such as, an efficient GPU on-chip interconnect arbitration scheme [1], more efficient GPU SIMD branch execution mechanisms [2], technique to diverge on a memory miss to better tolerate memory latencies in SIMD cores [4] etc. However, it is largely unknown whether the currently available GPGPU workloads are capable of evaluating the whole design space. An ideal evaluation mechanism must be accurate, thorough and realistic. Accuracy is provided by the applicability of the deduced conclusions that are minimally affected by the chosen benchmarks. A thorough evaluation mechanism covers a large amount of diverse benchmarks, where each workload stresses different aspects of the design. Realistic evaluation guarantees that for lesser number of simulations thoroughness can be achieved. Realistic evaluation narrows down the workload simulation space (time as well), while keeping the simulation fidelity and actual conclusion within an acceptable threshold. In order to achieve the above goals, we propose a set of GPU microarchitecture agnostic GPGPU workload characteristics to accurately capture workload behavior and use a wide range of metrics to evaluate the effectiveness of our characterization. We employ principal component analysis [7] and clustering analysis methods [8], which have been shown [9, 10, 11, 12] to be effective in analyzing benchmark suites such as SPEC CPU2000 [13], SPEC CPU2006 [51], MediaBench [14], MiBench [15], SPLASH-2 [16], STAMP [17] and PARSEC [18] benchmarks. This paper makes the following contributions: • We propose a set of GPGPU workload characterization metrics. Using 38×6 design points, we show that these metrics are independent of the underlying GPU microarchitecture. These metrics will allow GPGPU researchers to evaluate the performance of emerging GPU microarchitectures regardless of their microarchitectural improvements. Though, we characterize the GPGPU workloads using Nvidia GPU microarchitecture [19], the conclusions drawn here are mostly applicable to other GPU microarchitectures such as AMD ATI [23]. • Using the proposed GPGPU workload metrics, we study the similarities between existing GPGPU kernels and observe that they often stress the same bottlenecks. We show that removing redundancy can significantly save simulation time. • We provide workload categorization based on various workload subspaces like, divergence characteristics, kernel characteristics, memory coalescing etc. We categorize different workload characteristics according to their importance. We also show that available workload space is most diverse in terms of branch divergence characteristics and least diverse in terms of thread-batch level coalescing behavior. Relative diversity among Nvidia CUDA SDK [28], Parboil [26] and Rodinia [25] benchmark suites is also explored. The rest of this paper is organized as follow. Section II provides the background of GPU microarchitecture along with CUDA [19] programming model. Section III describes the proposed GPU kernel characterization metrics as well as the statistical methods used for data analysis. Section IV describes our experimental Exploring GPGPU Workloads: Characterization Methodology, Analysis and Microarchitecture Evaluation Implications Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, and Tao Li Intelligent Design of Efficient Architecture Lab (IDEAL) University of Florida, Gainesville, Florida, USA {nil, sramkumar, mjoshi}@ufl.edu, taoli@ece.ufl.edu