IEEE TRANSACTIONS ON MAGNETICS, VOL. 46, NO. 8, AUGUST 2010 3081 The Solution of Electromagnetic Field Problems Using a Sliding Window Gauss-Seidel Algorithm on a Multicore Processor Hussein Moghnieh and David A. Lowther Department of Electrical and Computer Engineering, McGill University, Montreal, QC H3A 2A7, Canada Chip-based multicore processors (CMPs) raise the possibility of significant improvement in the performance of electromagnetic sim- ulation tools. They can impact the mesh generation, solution, and result evaluation phases. This paper investigates the parallelization and scalability of Gauss–Seidel on CMPs by using a new cache blocking technique to overcome the small cache problem while using a thread synchronization technique for better cache sharing and to maximize thread cycle utilization. Index Terms—Cache blocking, chip multicore processors (CMPs), electromagnetic field problem solver, Gauss–Seidel iterative method. I. INTRODUCTION T HE computer-based simulation of an electromagnetic field problem using a differential technique, such as finite differences or elements, consists of several phases which are computationally intensive and have complexities which can be beyond linear. These include mesh generation, equation solution, and result evaluation. The introduction of chip-based multicore processors (CMPs), both within the main central pro- cessing unit (CPU) and as part of high-performance graphics systems, provides the possibility of significant speedups over the existing single-core systems. Each phase of the simulation system presents its own challenges to parallelization. This paper targets the equation solution phase. In a differential method, the equation sets produced are large and sparse and their solution has been the subject of considerable research over the last four decades and, for many problems, the present algorithm of choice is the Incomplete Choleski-Congugate Gradient approach which has a complexity of approximately . This is, however, a predominantly sequential algorithm optimized for a single processor machine. It is not obvious that this is an ideal algorithm for the new generation of processors and, thus, it is worth revisiting a range of solver algorithms and to re-examine their performance on the new architectures. II. PROCESSOR ARCHITECTURE AND SOLVERS One of the key issues in considering an algorithm for a mul- ticore system is the architecture of the processor. This must be considered as the given environment, and the goal is to find an efficient algorithm for this architecture, not develop an effec- tive architecture for a particular parallel algorithm. Most multi- core machines have been designed to handle several relatively small tasks in parallel—not to divide one large task amongst Manuscript received December 17, 2009; accepted April 06, 2010. Current version published July 21, 2010. Corresponding author: D. A. Lowther (e-mail: david.lowther@mcgill.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMAG.2010.2048421 the processors which is the case for multiprocessor-based com- puters. Although CMP has the advantage of low inter-thread communication and synchronization due to cache sharing, the cache memory is small relative to the computational power of the cores, and this limits the amount available for each core. In addition, the memory bandwidth between the cores and the main system memory is relatively small. Consequently, the ideal al- gorithm for this architecture is one which can allow for the equa- tion set to be broken up between the cores and to maximize the utilization of the equation set present in the cache. In recent years, the high-performance computing commu- nity has been revisiting currently and long abandoned numer- ical methods to gain performance in solving a large system of linear equations. Numerical techniques have been tweaked and implemented to target specific emerging parallel hardware ar- chitecture. In this paper, we investigate the speedup of a parallel Gauss–Seidel algorithm on a CMP. Iterative methods, such as Jacobi and Gauss–Seidel, have been investigated on parallel systems. The main concern has been to reduce the synchro- nization points between processors leading to schemes known as chaotic or relaxation schemes [1], which are not suitable for implementation on a CMP. In addition, the performance of a parallel Gauss–Seidel as a multigrid smoother on CMP has also been investigated in [2], where cache blocking [3], a technique to reuse data in cache, was used to decrease cache misses, hence increasing performance. It was applied in con- junction with red-black and natural reordering techniques of the problem—techniques used to parallelize stationary iterative methods. However, a strong order of execution was imposed on threads, leading to lower thread execution time relative to thread waiting time. It appears that for larger problems, the gap between the slow data access due to a small cache relative to the high number of flops available is higher on CMP than other hardware ar- chitectures. Therefore, efficient cache management is critical to achieve better performance. Cached data reuse and fair cache sharing among threads are essential. For this reason, a synchronized data-pipelining threading technique [4] (i.e., producer-consumer model) is used to provide better commu- nication and synchronization between threads and to provide fair cache sharing and partitions between cores of the CMP. The data-pipelining programming model for parallel iterative 0018-9464/$26.00 © 2010 IEEE