Loop Parallelisation for the Jikes RVM Jisheng Zhao, Dr. Ian Rogers, Dr. Chris Kirkham, Prof. Ian Watson The University of Manchester {jisheng.zhao, ian.rogers, christopher.kirkham, ian.watson}@manchester.ac.uk Abstract Increasing the number of instructions executing in parallel has helped improve processor performance, but the technique is limited. Executing code on parallel threads and processors has fewer limitations, but most computer programs tend to be serial in nature. This paper presents a compiler optimisation that at run-time parallelises code inside a JVM and thereby increases the number of threads. We show Spec JVM benchmark results for this optimisation. The performance on a current desktop processor is slower than without parallel threads, caused by thread creation costs, but with these costs removed the perfor- mance is better than the serial code. We measure the threading costs and discuss how a future computer architecture will en- able this optimisation to be feasible in exploiting thread instead of instruction and/or vector parallelism. 1. Introduction Parallelising compilers, sometimes known as supercompil- ers, will soon be common place with GCC 4, where support comes in the form of vectorisation [12]. Vectorisation is used to produce code for the single-instruction multiple-data (SIMD) instruction sets of modern multimedia architectures. An alter- native form of parallelisation to vectorisation is thread based. Threads can execute on separate computing resources and hence in parallel with each other. Vectorisation requires analysis of inner loops of code and to determine if they can be made to work on several data operands at the same time. Parallelisation can work on any loop of code, but the cost of thread creation and completion detection mean it is best performed on outer loops [9,16]. Outer loops invariably have more complex data dependencies within them, and thus detecting loops to parallelise often isn’t possible. Consequently current microprocessors offer good vector support and limited support for the threading model. In this work we take the view that vectorisation, as with ILP, will become limited and that a better compromise in the de- sign balance may be for more parallel execution units (i.e. more cores in a chip multi-processor) rather than extending hardware support for vectors and ILP. We present a new implementation of the well-known and simplistic DOALL parallelisation com- piler optimisation [9, 16]. Uniquely for our system we imple- ment this in the environment of a JVM, to allow for parallelisa- tion to occur at run-time and, compared to prior JVM oriented systems, without the involvement of the programmer. Our work assumes that all loops are best parallelised into threads, a technique which, as described above, is overly eager for current microprocessors. However, if the cost of maintaining a thread were brought close to the cost of farming work to vector processing units then we believe threads are to be preferred to vectors as the parallel execution units are general purpose. This is the model currently proposed by our Jamaica architecture [1]. However, it may ultimately be likely that the best situation is a compromise of vector and parallel resources in a heterogeneous chip multiprocessor environment. This paper is split into 5 further sections. Section 2 describes an initial set of loop optimisations used to make Java bytecode amenable to parallelisation. Section 3 describes the loop paral- lelisation optimisation itself. We perform a performance anal- ysis of thread creation and completion costs, and then measure our performance, the current performance and the performance without thread costs, in Section 4. Section 5 discusses runtime parallelisation, as has been demonstrated in this paper, motivat- ing future computer architecture and compiler research. Sec- tion 6 concludes the paper. 2. Making Java Loops Parallelisable The existing work such as javar, High Performance Fortran and OpenMP have allowed programmers to express which loops were amenable to thread parallelisation and the compiler could forget dependence analysis [3, 7, 4]. Substantial research work has shown how programs can be transformed by the compiler and better exploit available parallelism. This work has included automatic parallelisation, that has no need for programmer in- tervention or compiler constructs that imply dependencies can be ignored. GCC 4 ﬁts this criterion, as do Intel’s Fortran/C compiler and Matlab*P [12, 2, 6]. These previous solutions have looked at statically determin- ing parallelism and then creating binaries to utilise it. We propose using similar techniques in the dynamic compilation environment of Java - speciﬁcally the Jikes Research Virtual Machine (RVM) [8]. Dynamic compilation enables run-time feedback to guide where optimisation could be performed. The loops we are looking for are the simplest form to paral- lelise, DOALL amenable loops. These loops have no loop car- ried dependencies. So we could consider parallelising an ar- ray ﬁll routine such as this from the GNU Classpath library’s java.util.Arrays implementation: for ( int i = fromIndex ; i < toIndex ; i ++) a[i] = val; Figure 1. DOALL amenable memory assignment loop Java adds a complication to dependency analysis - excep- tions. The loop above when in the internal form of the Jikes RVM has guards added that capture these dependencies and im- plicitly adds exception edges to the control-ﬂow graph (CFG). Proceedings of the Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’05) 0-7695-2405-2/05 $20.00 © 2005 IEEE