CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
Concurrency Computat.: Pract. Exper. 2012; 24:463–480
Published online 2 October 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1848
SPECIAL ISSUE PAPER
Compiler and runtime support for enabling reduction
computations on heterogeneous systems
Vignesh T. Ravi, Wenjing Ma, David Chiu and Gagan Agrawal
*
,†
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
SUMMARY
A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous
computing platforms. Presently, it has become very common for a desktop or a notebook computer to come
equipped with both a multi-core CPU and a graphics processing unit (GPU). Capitalizing on the maximum
computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and
the GPU), starting from a high-level API, is a critical challenge. We believe that it would be highly desirable
to support a simple way for programmers to realize the full potential of today’s heterogeneous machines.
This paper describes a compiler and runtime framework that can map a class of applications, namely those
characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple
C functions with added annotations, we automatically generate the middleware API code for the multi-core,
as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes
for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from
two applications, for example, k-means clustering and principal component analysis, show that, through
effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance com-
pared with using only the GPU or the multi-core CPU. In k-means clustering, the heterogeneous version
with eight CPU cores and a GPU achieved a speedup of about 32.09x relative to one-thread CPU. When
compared with the faster of CPU-only and GPU-only executions, we were able to achieve a performance
gain of about 60%. In principal component analysis, the heterogeneous version attained a speedup of 10.4x
relative to the one-thread CPU version. When compared with the faster of CPU-only and GPU-only versions,
the heterogeneous version achieved a performance gain of about 63.8%. Copyright © 2011 John Wiley &
Sons, Ltd.
Received 6 February 2011; Revised 20 June 2011; Accepted 28 July 2011
1. INTRODUCTION
The traditional method to improve processor performance, that is, by increasing clock frequencies,
has become physically infeasible. To help offset this limitation, multi-core CPU and many-core
graphics processing unit (GPU) architectures have emerged as a cost-effective means for scaling
performance. This, however, has created a programmability challenge. On the one hand, a large
body of research has been focused on the effective utilization of multi-core CPUs. For instance,
library support and programming models are currently being developed for efficient program-
ming on multi-core platforms [1, 2]. And on the other hand, scientists have also been seeking
ways to unleash the power of the GPU for general-purpose computing [3–6]. Whereas a variety
of applications have been successfully mapped to GPUs, programming them remains a challeng-
ing task. For example, Nvidia’s Compute Unified Device Architecture (CUDA) [23], the most
widely used computing architecture for GPUs to date, requires low-level programming and manual
memory management.
*Correspondence to: Gagan Agrawal, Department of Computer Science and Engineering, The Ohio State University,
Columbus, OH 43210, USA.
†
E-mail: agrawal@cse.ohio-state.edu
Copyright © 2011 John Wiley & Sons, Ltd.