Job Parallelism using Graphical Processing Unit Individual Multi-Processors and Localised Memory D.P. Playne and K.A. Hawick Computer Science, Massey University, North Shore 102-904, Auckland, New Zealand d.p.playne@massey.ac.nz, k.a.hawick@massey.ac.nz Tel: +64 9 414 0800 Fax: +64 9 441 8181 April 2013 Abstract Graphical Processing Units(GPUs) are usually programmed to provide data-parallel acceleration to a host processor. Modern GPUs typically have an internal multi-processor (MP) struc- ture that can be exploited in an unusual way to offer semi- independent task parallelism providing the MPs can operate within their own localised memory and apply data-parallelism to their own problem subset. We describe a combined simu- lation and statistical analysis application using component la- belling and benchmark it on a range of modern GPU and CPU devices with various numbers of cores. As well as demonstrat- ing a high degree of job parallelism and throughput we ﬁnd a typical GPU MP outperforms a conventional CPU core. Keywords: GPU; task parallelism; data parallelism; hybrid parallelism; multi-processor. 1 Introduction A great deal of the present research and developmental effort going into processor development is in increasing the num- ber of cores that can be used concurrently on a single proces- sor chip package. At the time of writing there are two com- plementary approaches being adopted. The ﬁrst is addition of high capability central processing unit (CPU) cores, where each core presents computational capabilities to the applica- tions programmer that individually appear very much the same as a traditional single core CPU. This approach is very much linked to the processor product development approach taken by Intel and AMD and at the time of writing is typiﬁed by de- vices with 4, 6, 8 core with recent devices announced ﬁelding 16 and 32 such cores. The other approach is that typiﬁed by the GPU devices ﬁelded by companies like NVidia. To a large extent the recent success of GPUs for general purpose (non graphical) programming has been due to the data parallelism possibilities offered by the large and rapidly growing number of simpler compute cores available. Recent GPUs have ﬁelded 512 and 1536 such cores. In this paper we explore the idea that one can also program GPUs in a manner closer to that of the traditional CPU core by focusing on the streaming multi-processors (MPs) and the resources available to them. A modern GPU has a broadly similar number of MPs as the number of compute cores on a modern CPU. In this respect it appears that vendors like Intel and NVidia are approaching the same problem but from differ- ent directions. This has interesting implications for future and hybrid devices. We are interested in how one can use GPU MPs using a job parallelism approach. In separate work we have explored how jobs can be placed on completely separate GPU accelerators run by the same CPU host program, but in this present paper we explore independent jobs running on the MPs of a single GPU. There are a number of appropriate simulation models for which this is a powerful paradigm for enhancing throughput. We present performance analysis based on an example such as simulating the 2D Game of Life (GoL) cellular automa- ton (CA) model [5, 7], but we also incorporate a sophisticated model analysis algorithm using cluster component labelling and histogramming [11]. Component labelling [6, 22] is a long standing problem of interest on parallel computers with a range of parallel ap- proaches reported in the literature [19]. We have reported prior work of our own in achieving fast component labelling on a single GPU [8] where memory was not at a premium. In this present paper we include a report of our new work in achieving component labelling performed within the memory resources of a single MP. Using bit-wise programming instructions and data structures we are able to cram a combined simulation model and its sta- tistical component analysis software into the memory of indi- vidual MPs. The hosting CPU is thus able to manage indepen- dent jobs across all the MPs of its accelerating GPU device, and furthermore this can be extended if more than one GPU device is available. The problem of managing job parallelism on modern multi- core devices is not a new one and much has been shown to depend of the efﬁciency and ease of programming using mul- tiple threads of control [2, 3, 9, 10, 17]. These technologies are