Parallel Commutation of Sparse Linear Systems on Many Core Processor Abu Sayed Md. Mostaﬁzur Rahaman, Jesmin Akhter 1 and Mohammad Touhidur Rahman 2 Department of computer Science and Engineering, Jahangirnagar University, Savar, Dhaka, Bangladesh 1 Institute of Information Technology, Jahangirnagar University, Savar, Dhaka, Bangladesh 2 Daﬀodil Institute of IT, Banani Branch, Dhaka, Bangladesh chyonecs@yahoo.com , togorcse@yahoo.com and rahmantm@gmail.com ABSTRACT In this paper, the authors describe the parallel implementation of a conjugate gradient method in a many-core system specially for solving the sparse linear systems. The new version of algorithm implementation diﬀers from the one applied earlier [1], because it uses a special method for storing sparse coeﬃcient matrices: only non-zero elements are stored and taken into account during computations, so that the sparsity of the coeﬃcient matrix is taken full advantage of. Finally, A speedup of the parallel algorithm has been examined for diﬀerent coeﬃcient sparse matrices resulting in solving diﬀerent physical problems. Keywords: Conjugate gradient, sparse matrix, many- core, PCGA and linear system. I. INTRODUCTION Many important problems in applied science and engineering, such as Navier Stakes equations in ﬂuid dynamics, the primitive equations in global climate modeling, the strain-stress equations in mechanical and material engineering, and neutron diﬀusion equation in nuclear engineering contain complicated systems of partial diﬀerential equations (PDEs). When approximated numerically on a discrete grid or mesh, such problems produce large systems of algebraic linear and non-linear equations, whose numerical solutions may be prohibitively expensive in terms of time and storage. The discretization of linear and non-linear partial diﬀerential equations (PDEs) produces a large sparse matrix. Applications involving sparse matrices can experience signiﬁcant performance degradation on general purpose processors. The classic example is sparse matrix-vector multiply, which has a high ratio of memory references to ﬂoating point arithmetic operations and suﬀers from irregular memory access patterns. Further, the n-vector, x, cannot ﬁt in the general purpose processor's cache for large n, so there may be little chance for data reuse. Over the past 30 years, researchers have tried to mitigate the poor performance of sparse matrix computations through various approaches such as reordering the data to reduce wasted memory bandwidth, modifying the algorithms to reuse the data, and even building specialized memory controllers. Despite these eﬀorts, sparse matrix performance on general purpose processors still depends on the matrix's sparsity structure. But in recent times, with a large capacity, intrinsic parallelism and ﬂexibility, multi-core has prompted researchers to map computational kernels onto the systems. In some instances, these kernels achieve signiﬁcant speedups over their software-only counterparts running on general-purpose processors. In the research project, our focus is to accelerate sparse linear systems solver on the latest many-core processor using iterative conjugate gradient method which has the parallel facility for solving sparse linear system eﬀectively. However, parallel implementations of the method applied in practice are not universal (suitable for all physical problems) – and they cannot be, because the convergence rate of the method strongly depends on the properties of a coeﬃcient matrix. For symmetric, positive deﬁnite matrices that is used in this research project, there are a lot of very good serial and parallel implementations, but a problem arises, when the matrix is not symmetric or positive deﬁnite. There are a few modiﬁcations of the method that help to achieve the convergence in such cases, but the convergence rate is still strongly dependent on the problem structure and usually is much worse than in the unmodiﬁed versions. A parallel implementation of the conjugate gradient method helps to shorten the computation time and it often even brings a possibility of solving particular, very large problem that could not be solved on a single machine because of its limitations (especially in capacity of a physical memory). One can create his own version, which works optimally for the particular problem structure and in the actual hardware environment. It can be especially useful in a heterogeneous computer cluster. Taking into account the diﬀerences in the computational power between the particular computers in a cluster, it can be important to appropriately distribute the computational data. In the version implemented by the authors the data for computations are distributed unevenly among all the cores. Matrix transposition and matrix-matrix multiplication operations are avoided (comparing to the previous version described in [1]). This allowed the authors to apply a new way of data partitioning, which is much better when it comes to storage consumption, as well as to a number of computations. The coeﬃcient matrix is partitioned during reading the input data ﬁle and the parts of the matrix are immediately sent to the appropriate core. This makes possible to solve much larger problems than in the previous version. A. Related works The parallel solution of linear systems of equations is a well examined but still active ﬁeld in high-performance 987-161284-908-9/11/$26.00 © 2011 IEEE Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh