Parallel Commutation of Sparse Linear Systems on Many Core Processor Abu Sayed Md. Mostafizur Rahaman, Jesmin Akhter 1 and Mohammad Touhidur Rahman 2 Department of computer Science and Engineering, Jahangirnagar University, Savar, Dhaka, Bangladesh 1 Institute of Information Technology, Jahangirnagar University, Savar, Dhaka, Bangladesh 2 Daffodil Institute of IT, Banani Branch, Dhaka, Bangladesh chyonecs@yahoo.com , togorcse@yahoo.com and rahmantm@gmail.com ABSTRACT In this paper, the authors describe the parallel implementation of a conjugate gradient method in a many-core system specially for solving the sparse linear systems. The new version of algorithm implementation differs from the one applied earlier [1], because it uses a special method for storing sparse coefficient matrices: only non-zero elements are stored and taken into account during computations, so that the sparsity of the coefficient matrix is taken full advantage of. Finally, A speedup of the parallel algorithm has been examined for different coefficient sparse matrices resulting in solving different physical problems. Keywords: Conjugate gradient, sparse matrix, many- core, PCGA and linear system. I. INTRODUCTION Many important problems in applied science and engineering, such as Navier Stakes equations in fluid dynamics, the primitive equations in global climate modeling, the strain-stress equations in mechanical and material engineering, and neutron diffusion equation in nuclear engineering contain complicated systems of partial differential equations (PDEs). When approximated numerically on a discrete grid or mesh, such problems produce large systems of algebraic linear and non-linear equations, whose numerical solutions may be prohibitively expensive in terms of time and storage. The discretization of linear and non-linear partial differential equations (PDEs) produces a large sparse matrix. Applications involving sparse matrices can experience significant performance degradation on general purpose processors. The classic example is sparse matrix-vector multiply, which has a high ratio of memory references to floating point arithmetic operations and suffers from irregular memory access patterns. Further, the n-vector, x, cannot fit in the general purpose processor's cache for large n, so there may be little chance for data reuse. Over the past 30 years, researchers have tried to mitigate the poor performance of sparse matrix computations through various approaches such as reordering the data to reduce wasted memory bandwidth, modifying the algorithms to reuse the data, and even building specialized memory controllers. Despite these efforts, sparse matrix performance on general purpose processors still depends on the matrix's sparsity structure. But in recent times, with a large capacity, intrinsic parallelism and flexibility, multi-core has prompted researchers to map computational kernels onto the systems. In some instances, these kernels achieve significant speedups over their software-only counterparts running on general-purpose processors. In the research project, our focus is to accelerate sparse linear systems solver on the latest many-core processor using iterative conjugate gradient method which has the parallel facility for solving sparse linear system effectively. However, parallel implementations of the method applied in practice are not universal (suitable for all physical problems) – and they cannot be, because the convergence rate of the method strongly depends on the properties of a coefficient matrix. For symmetric, positive definite matrices that is used in this research project, there are a lot of very good serial and parallel implementations, but a problem arises, when the matrix is not symmetric or positive definite. There are a few modifications of the method that help to achieve the convergence in such cases, but the convergence rate is still strongly dependent on the problem structure and usually is much worse than in the unmodified versions. A parallel implementation of the conjugate gradient method helps to shorten the computation time and it often even brings a possibility of solving particular, very large problem that could not be solved on a single machine because of its limitations (especially in capacity of a physical memory). One can create his own version, which works optimally for the particular problem structure and in the actual hardware environment. It can be especially useful in a heterogeneous computer cluster. Taking into account the differences in the computational power between the particular computers in a cluster, it can be important to appropriately distribute the computational data. In the version implemented by the authors the data for computations are distributed unevenly among all the cores. Matrix transposition and matrix-matrix multiplication operations are avoided (comparing to the previous version described in [1]). This allowed the authors to apply a new way of data partitioning, which is much better when it comes to storage consumption, as well as to a number of computations. The coefficient matrix is partitioned during reading the input data file and the parts of the matrix are immediately sent to the appropriate core. This makes possible to solve much larger problems than in the previous version. A. Related works The parallel solution of linear systems of equations is a well examined but still active field in high-performance 987-161284-908-9/11/$26.00 © 2011 IEEE Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh