Implementing Multithreaded Programs using CUDA for GPGPU to Solve Matrix Multiplication Mehdi G. Duaimi 1 , Abbas F. J. AL-Gburi 2 , Ehsan A. Al-Zubaidi 3 , Ibraheem Al- Jadir 4, 1Dept. of Computer Science, College of Sciences, University of Baghdad, Baghdad, Iraq. 2Iraqi Ministry of Finance / Economic Dept. / Information Systems Dept. Baghdad, Iraq. 3Dept. of Computer Center , Faculty of Physical planning, University of Kufa, Najaf, Iraq. 4School of Engineering and Information Technology, Murdoch University, Perth, Australia. Abstract A most significant process in linear algebra is the performance variation through datasets of similar size in matrix-vector multiplication. Hence, there might be a new storing design of a new "two-dimensional blocking technique" that can successfully cope with variety of challenges. This format is called "blocked row-column" (BRC). The central aim of the present paper is to design and implement multithreaded programing algorithm using CUDA for GPGPU and analyze the performance of CUDA program. Moreover, the paper also aims at comparing the performance of the open MP program with the previous work. The algorithm is designed using the CUDA libraries in order to process matrix multiplication on the GPU. Using the libraries of CUDA and some of its functions, we optimize the performance by using maximum size of the GPU blocks, which are used to compute the matrix multiplication. The results of the paper show that the memory usage in 2D array method CUDA design utilizes more space in comparison to the 1D array-based. Therefore, CUDA 1D array (flatten 2D array to 1D array) is the finest format as the parallel matrix multiplication considering critical factors of "high speed and minimal memory space consumption". So, the paper concludes that GPU version of the matrix multiplication in the GPU performance is better than the Open MP version of the parallel GPU. 1. INTRODUCTION In its effort to determine which of the methods is to be regarded as the most efficient one for matrix processing, NVIDIA has offered a critical look at different performance techniques. These techniques included tiling, memory consolidation, perfecting, and looping. This led to the fact that the problem of the matrix operations lies in matrices of size n. So, matrix multiplication implementation in GPUs has assigned a subset of a "row and column of A and B" matrices. Moreover, there was a look at global memory that is to use a faster type of memory consumption to alleviate long-term standby and bandwidth problems. Nevertheless, "the sparse matrix-vector multiplication" (SpMV) in sparse linear algebra seems to have a singular importance. The use of extremely appropriate SpMV methods is related to efficiency-oriented architectures, such as GPUs that use several common scatter classes. The technique is well-suited in order to efficiently and successfully utilize large percentages of peak bandwidth [2]. The methods of the author are both focusing on the parallel calculation of sparse matrix-vector products (SpMV), and CPU and GPU microarchitectures are both used in a wide range of multipurpose matrix datasets [3]. Moreover, the methods performance is inconsistent and is often confined to order-of-magnitude. The performance variation is across similarly-sized datasets. So that, a new storage format with new two-dimensional blocking technique can efficiently solve cope with the challenges, this new format is called blocked row-column (BRC). The same program is unified via reordering and grouping the input matrix with elements which are equal to zero or not [4]. Besides, the large parallelism of graphics cards (GPUs) can accelerate the process of calculations. For the purpose of doing this, a platform can be combined with the unified device architecture of NVIDIA or CUDA. An example is the "Parallelized calculation of matrix-vector multiplication of arbitrary size" [6]. This is related to the supposition of graphical partitioning techniques that can only solve the square and symmetric matrix. Hyper-graphical partitioning methods will overcome the inadequacies of the graphics partitioning technique. 2. RELATED WORKS One of the too recurrent and at the same time sluggish mathematical operations is the matrix multiplication. It has a time complexity approximate to cubic "even if we use some improvements, such as Strassen’s" [4]-[5]. There are several reasons to take this operation as a trial instance, these details include: a) the behavior at numerous sizes of input data can be predicted because it comprises simple operation, b) The operation basic application is very simple and widely known, and c) the operation has appropriate characteristics features of all its sub-operations such as (commutatively and distributives of addition and Journal of Xi'an University of Architecture & Technology Volume XII, Issue III, 2020 Issn No : 1006-7930 Page No: 3083