Parallel Commutation of Sparse Linear Systems on Many
Core Processor
Abu Sayed Md. Mostafizur Rahaman, Jesmin Akhter
1
and Mohammad Touhidur Rahman
2
Department of computer Science and Engineering, Jahangirnagar University, Savar, Dhaka, Bangladesh
1
Institute of Information Technology, Jahangirnagar University, Savar, Dhaka, Bangladesh
2
Daffodil Institute of IT, Banani Branch, Dhaka, Bangladesh
chyonecs@yahoo.com , togorcse@yahoo.com and rahmantm@gmail.com
ABSTRACT
In this paper, the authors describe the parallel
implementation of a conjugate gradient method in a
many-core system specially for solving the sparse linear
systems. The new version of algorithm implementation
differs from the one applied earlier [1], because it uses
a special method for storing sparse coefficient matrices:
only non-zero elements are stored and taken into
account during computations, so that the sparsity of the
coefficient matrix is taken full advantage of. Finally, A
speedup of the parallel algorithm has been examined for
different coefficient sparse matrices resulting in solving
different physical problems.
Keywords: Conjugate gradient, sparse matrix, many-
core, PCGA and linear system.
I. INTRODUCTION
Many important problems in applied science and
engineering, such as Navier Stakes equations in fluid
dynamics, the primitive equations in global climate
modeling, the strain-stress equations in mechanical and
material engineering, and neutron diffusion equation in
nuclear engineering contain complicated systems of
partial differential equations (PDEs). When
approximated numerically on a discrete grid or mesh,
such problems produce large systems of algebraic linear
and non-linear equations, whose numerical solutions
may be prohibitively expensive in terms of time and
storage. The discretization of linear and non-linear
partial differential equations (PDEs) produces a large
sparse matrix. Applications involving sparse matrices
can experience significant performance degradation on
general purpose processors. The classic example is
sparse matrix-vector multiply, which has a high ratio of
memory references to floating point arithmetic
operations and suffers from irregular memory access
patterns. Further, the n-vector, x, cannot fit in the
general purpose processor's cache for large n, so there
may be little chance for data reuse. Over the past 30
years, researchers have tried to mitigate the poor
performance of sparse matrix computations through
various approaches such as reordering the data to reduce
wasted memory bandwidth, modifying the algorithms to
reuse the data, and even building specialized memory
controllers.
Despite these efforts, sparse matrix performance on
general purpose processors still depends on the matrix's
sparsity structure. But in recent times, with a large
capacity, intrinsic parallelism and flexibility, multi-core
has prompted researchers to map computational kernels
onto the systems. In some instances, these kernels
achieve significant speedups over their software-only
counterparts running on general-purpose processors.
In the research project, our focus is to accelerate sparse
linear systems solver on the latest many-core processor
using iterative conjugate gradient method which has the
parallel facility for solving sparse linear system
effectively. However, parallel implementations of the
method applied in practice are not universal (suitable for
all physical problems) – and they cannot be, because the
convergence rate of the method strongly depends on the
properties of a coefficient matrix. For symmetric,
positive definite matrices that is used in this research
project, there are a lot of very good serial and parallel
implementations, but a problem arises, when the matrix
is not symmetric or positive definite. There are a few
modifications of the method that help to achieve the
convergence in such cases, but the convergence rate is
still strongly dependent on the problem structure and
usually is much worse than in the unmodified versions.
A parallel implementation of the conjugate gradient
method helps to shorten the computation time and it
often even brings a possibility of solving particular, very
large problem that could not be solved on a single
machine because of its limitations (especially in
capacity of a physical memory). One can create his own
version, which works optimally for the particular
problem structure and in the actual hardware
environment. It can be especially useful in a
heterogeneous computer cluster. Taking into account the
differences in the computational power between the
particular computers in a cluster, it can be important to
appropriately distribute the computational data. In the
version implemented by the authors the data for
computations are distributed unevenly among all the
cores. Matrix transposition and matrix-matrix
multiplication operations are avoided (comparing to the
previous version described in [1]). This allowed the
authors to apply a new way of data partitioning, which
is much better when it comes to storage consumption, as
well as to a number of computations. The coefficient
matrix is partitioned during reading the input data file
and the parts of the matrix are immediately sent to the
appropriate core. This makes possible to solve much
larger problems than in the previous version.
A. Related works
The parallel solution of linear systems of equations is a
well examined but still active field in high-performance
987-161284-908-9/11/$26.00 © 2011 IEEE
Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh