Loop Parallelisation for the Jikes RVM
Jisheng Zhao, Dr. Ian Rogers, Dr. Chris Kirkham, Prof. Ian Watson
The University of Manchester
{jisheng.zhao, ian.rogers, christopher.kirkham, ian.watson}@manchester.ac.uk
Abstract
Increasing the number of instructions executing in parallel
has helped improve processor performance, but the technique is
limited. Executing code on parallel threads and processors has
fewer limitations, but most computer programs tend to be serial
in nature. This paper presents a compiler optimisation that at
run-time parallelises code inside a JVM and thereby increases
the number of threads. We show Spec JVM benchmark results
for this optimisation. The performance on a current desktop
processor is slower than without parallel threads, caused by
thread creation costs, but with these costs removed the perfor-
mance is better than the serial code. We measure the threading
costs and discuss how a future computer architecture will en-
able this optimisation to be feasible in exploiting thread instead
of instruction and/or vector parallelism.
1. Introduction
Parallelising compilers, sometimes known as supercompil-
ers, will soon be common place with GCC 4, where support
comes in the form of vectorisation [12]. Vectorisation is used
to produce code for the single-instruction multiple-data (SIMD)
instruction sets of modern multimedia architectures. An alter-
native form of parallelisation to vectorisation is thread based.
Threads can execute on separate computing resources and hence
in parallel with each other.
Vectorisation requires analysis of inner loops of code and to
determine if they can be made to work on several data operands
at the same time. Parallelisation can work on any loop of code,
but the cost of thread creation and completion detection mean it
is best performed on outer loops [9,16]. Outer loops invariably
have more complex data dependencies within them, and thus
detecting loops to parallelise often isn’t possible. Consequently
current microprocessors offer good vector support and limited
support for the threading model.
In this work we take the view that vectorisation, as with ILP,
will become limited and that a better compromise in the de-
sign balance may be for more parallel execution units (i.e. more
cores in a chip multi-processor) rather than extending hardware
support for vectors and ILP. We present a new implementation
of the well-known and simplistic DOALL parallelisation com-
piler optimisation [9, 16]. Uniquely for our system we imple-
ment this in the environment of a JVM, to allow for parallelisa-
tion to occur at run-time and, compared to prior JVM oriented
systems, without the involvement of the programmer.
Our work assumes that all loops are best parallelised into
threads, a technique which, as described above, is overly eager
for current microprocessors. However, if the cost of maintaining
a thread were brought close to the cost of farming work to vector
processing units then we believe threads are to be preferred to
vectors as the parallel execution units are general purpose. This
is the model currently proposed by our Jamaica architecture [1].
However, it may ultimately be likely that the best situation is a
compromise of vector and parallel resources in a heterogeneous
chip multiprocessor environment.
This paper is split into 5 further sections. Section 2 describes
an initial set of loop optimisations used to make Java bytecode
amenable to parallelisation. Section 3 describes the loop paral-
lelisation optimisation itself. We perform a performance anal-
ysis of thread creation and completion costs, and then measure
our performance, the current performance and the performance
without thread costs, in Section 4. Section 5 discusses runtime
parallelisation, as has been demonstrated in this paper, motivat-
ing future computer architecture and compiler research. Sec-
tion 6 concludes the paper.
2. Making Java Loops Parallelisable
The existing work such as javar, High Performance Fortran
and OpenMP have allowed programmers to express which loops
were amenable to thread parallelisation and the compiler could
forget dependence analysis [3, 7, 4]. Substantial research work
has shown how programs can be transformed by the compiler
and better exploit available parallelism. This work has included
automatic parallelisation, that has no need for programmer in-
tervention or compiler constructs that imply dependencies can
be ignored. GCC 4 fits this criterion, as do Intel’s Fortran/C
compiler and Matlab*P [12, 2, 6].
These previous solutions have looked at statically determin-
ing parallelism and then creating binaries to utilise it. We
propose using similar techniques in the dynamic compilation
environment of Java - specifically the Jikes Research Virtual
Machine (RVM) [8]. Dynamic compilation enables run-time
feedback to guide where optimisation could be performed.
The loops we are looking for are the simplest form to paral-
lelise, DOALL amenable loops. These loops have no loop car-
ried dependencies. So we could consider parallelising an ar-
ray fill routine such as this from the GNU Classpath library’s
java.util.Arrays implementation:
for ( int i = fromIndex ; i < toIndex ; i ++)
a[i] = val;
Figure 1. DOALL amenable memory assignment
loop
Java adds a complication to dependency analysis - excep-
tions. The loop above when in the internal form of the Jikes
RVM has guards added that capture these dependencies and im-
plicitly adds exception edges to the control-flow graph (CFG).
Proceedings of the Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’05)
0-7695-2405-2/05 $20.00 © 2005 IEEE