Proceedings of the International Multiconference on Computer Science and Information Technology pp. 285–291 ISBN 978-83-60810-14-9 ISSN 1896-7094 Testing Tesla Architecture for Scientific Computing: the Performance of Matrix-Vector Product. Pawel Maciol Cracow University of Technology, Insitute of Computer Modelling, ul. Warszawska 24, 31-155 Kraków, Poland Email: pmaciol@pk.edu.pl Krzysztof Bana´ s AGH University of Science and Technology, Department of Applied Computer Science and Modelling, al. Mickiewicza 30, 30-059 Kraków, Poland Cracow University of Technology, Insitute of Computer Modelling, ul. Warszawska 24, 31-155 Kraków, Poland Email: pobanas@cyf-kr.edu.pl Abstract—The paper presents results of several experiments evaluating the performance of NVIDIA processors, implementing a new Tesla architecture, in matrix-vector multiplication. Three matrix forms, dense, banded and sparse, are considered together with three hardware platforms: NVIDIA Tesla C870 computing board, NVIDIA GeForce 8800 GTX graphics card and one of the newest Intel Xeon processors, E5462, with 1.6 GHz front side bus speed. The conclusions from experiments indicate what speed-ups can be expected when, instead of standard CPUs, accelerators in the form of presented GPUs are used for considered computational kernels. I. MOTIVATION T HE USE of graphics processing units (GPUs) in scientific computing is becoming an accepted alternative for cal- culations employing traditional CPUs [1]. The characteristics of GPUs, especially parallel execution capabilities and fast memory access, render them attractive in many application areas. One of the most important application domains is numerical linear algebra. The computational kernels from linear algebra are used in many scientific codes. Hence the widespread interest in porting and testing such kernels for GPUs [2]. The purpose of the present article is to assess the perfor- mance of the recent NVIDIA GPUs in performing one of linear algebra kernels, namely matrix-vector product. This kernel plays an important role in the implementation of iterative solvers for systems of linear equations. Moreover it is a typical memory-bound operation—its performance depends mainly on the speed of communication between a processor and memory chips, much less on the processing capabilities of the processor itself. The organization of the paper is the following. In the next section performance characteristics of NVIDIA GPUs are described and compared to characteristics of typical contempo- rary CPUs. Section III presents the matrix formats considered in the paper and the corresponding matrix-vector multiplica- tion algorithms. In Section IV the set-up of experiments as well as tests’ results are described. Finally, conclusions are drawn in Section V. II. PERFORMANCE CHARACTERISTICS OF CPUS AND GPUS A typical contemporary processor is a two- or four-core unit equipped with a memory hierarchy comprised of several layers of cache and the main memory. From the point of view of performance for scientific codes two parameters are of premium importance: the processing speed and the speed of data transfer from the memory. The processing speed depends on the number of cores, clock frequency and the number of instructions completed in every clock cycle. This last number varies greatly depending on the application. The theoretical maximum is usually two to four instructions per cycle. The practical performance can be closed to the maximum whenever “the memory wall” is not hit, i.e. the processor gets all necessary data on time. This is the case for BLAS Level 3 routines, like e.g. matrix-matrix product, on which the direct solution of systems of linear equations is usually based. There is a different situation with memory bound algo- rithms. If the processor cannot get the necessary data on time the performance can be equal to a small percentage of the maximum. Graphics processing units differs significantly from general purpose CPUs in both aspects affecting performance. Their processing speed is much greater due to the large number of specialised cores (though usually operating at lower fre- quencies that CPU cores). Also the throughput to the memory is greater for GPUs due to usually wider buses then that of CPUs. Hence both, processing speed limited and memory limited algorithms can benefit from off-loading to GPUs. One of serious drawbacks of contemporary GPUs is the use of single precision floating point numbers only. However, today all major producers of GPUs aiming at general purpose com- puting start offering double precision floating point capabilities in their products, so this limitation should shortly be overcome. A. An example CPU Let us take, as an example CPU, one of the newest Intel Quad-core Xeon processors, E5462 with four cores, 2.8 GHz 285