IEEE TRANSACTIONS ON MAGNETICS, VOL. 46, NO. 8, AUGUST 2010 3081
The Solution of Electromagnetic Field Problems Using a Sliding Window
Gauss-Seidel Algorithm on a Multicore Processor
Hussein Moghnieh and David A. Lowther
Department of Electrical and Computer Engineering, McGill University, Montreal, QC H3A 2A7, Canada
Chip-based multicore processors (CMPs) raise the possibility of significant improvement in the performance of electromagnetic sim-
ulation tools. They can impact the mesh generation, solution, and result evaluation phases. This paper investigates the parallelization
and scalability of Gauss–Seidel on CMPs by using a new cache blocking technique to overcome the small cache problem while using a
thread synchronization technique for better cache sharing and to maximize thread cycle utilization.
Index Terms—Cache blocking, chip multicore processors (CMPs), electromagnetic field problem solver, Gauss–Seidel iterative method.
I. INTRODUCTION
T
HE computer-based simulation of an electromagnetic
field problem using a differential technique, such as finite
differences or elements, consists of several phases which are
computationally intensive and have complexities which can
be beyond linear. These include mesh generation, equation
solution, and result evaluation. The introduction of chip-based
multicore processors (CMPs), both within the main central pro-
cessing unit (CPU) and as part of high-performance graphics
systems, provides the possibility of significant speedups over
the existing single-core systems. Each phase of the simulation
system presents its own challenges to parallelization. This
paper targets the equation solution phase.
In a differential method, the equation sets produced are
large and sparse and their solution has been the subject of
considerable research over the last four decades and, for many
problems, the present algorithm of choice is the Incomplete
Choleski-Congugate Gradient approach which has a complexity
of approximately . This is, however, a predominantly
sequential algorithm optimized for a single processor machine.
It is not obvious that this is an ideal algorithm for the new
generation of processors and, thus, it is worth revisiting a range
of solver algorithms and to re-examine their performance on
the new architectures.
II. PROCESSOR ARCHITECTURE AND SOLVERS
One of the key issues in considering an algorithm for a mul-
ticore system is the architecture of the processor. This must be
considered as the given environment, and the goal is to find an
efficient algorithm for this architecture, not develop an effec-
tive architecture for a particular parallel algorithm. Most multi-
core machines have been designed to handle several relatively
small tasks in parallel—not to divide one large task amongst
Manuscript received December 17, 2009; accepted April 06, 2010. Current
version published July 21, 2010. Corresponding author: D. A. Lowther (e-mail:
david.lowther@mcgill.ca).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMAG.2010.2048421
the processors which is the case for multiprocessor-based com-
puters. Although CMP has the advantage of low inter-thread
communication and synchronization due to cache sharing, the
cache memory is small relative to the computational power of
the cores, and this limits the amount available for each core. In
addition, the memory bandwidth between the cores and the main
system memory is relatively small. Consequently, the ideal al-
gorithm for this architecture is one which can allow for the equa-
tion set to be broken up between the cores and to maximize the
utilization of the equation set present in the cache.
In recent years, the high-performance computing commu-
nity has been revisiting currently and long abandoned numer-
ical methods to gain performance in solving a large system of
linear equations. Numerical techniques have been tweaked and
implemented to target specific emerging parallel hardware ar-
chitecture.
In this paper, we investigate the speedup of a parallel
Gauss–Seidel algorithm on a CMP. Iterative methods, such as
Jacobi and Gauss–Seidel, have been investigated on parallel
systems. The main concern has been to reduce the synchro-
nization points between processors leading to schemes known
as chaotic or relaxation schemes [1], which are not suitable
for implementation on a CMP. In addition, the performance
of a parallel Gauss–Seidel as a multigrid smoother on CMP
has also been investigated in [2], where cache blocking [3], a
technique to reuse data in cache, was used to decrease cache
misses, hence increasing performance. It was applied in con-
junction with red-black and natural reordering techniques of
the problem—techniques used to parallelize stationary iterative
methods. However, a strong order of execution was imposed
on threads, leading to lower thread execution time relative to
thread waiting time.
It appears that for larger problems, the gap between the slow
data access due to a small cache relative to the high number
of flops available is higher on CMP than other hardware ar-
chitectures. Therefore, efficient cache management is critical
to achieve better performance. Cached data reuse and fair
cache sharing among threads are essential. For this reason,
a synchronized data-pipelining threading technique [4] (i.e.,
producer-consumer model) is used to provide better commu-
nication and synchronization between threads and to provide
fair cache sharing and partitions between cores of the CMP.
The data-pipelining programming model for parallel iterative
0018-9464/$26.00 © 2010 IEEE