A New Approach for Accelerating the Sparse Matrix- vector Multiplication Ivan Šimeček xsimecek@fel.cvut.cz Department of Computer Science , Faculty of Electrical Engineering, Czech Technical University, Technická 2, 166 27 Prague 6, Czech Republic Sparse matrix-vector multiplication (shortly SpMV) is one of most common subroutines in the numerical linear algebra. The problem is that the memory access patterns during the SpMV are irregular and the utilization of cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpMV are based on matrix reordering and register blocking. These matrix transformations are designed to handle randomly occurring dense blocks in a sparse matrix. The efficiency of these transformations depends strongly on the presence of suitable blocks. The overhead of a reorganization of a matrix from one format to another one is often of the order of tens of executions of a SpMV. For that reason, such a reorganization pays off only if the same matrix A is multiplied with multiple different vectors, e.g., in iterative linear solvers. This paper introduces new approach for the acceleration the SpMV. This approach consists of 3 steps: 1) dividing matrix A into non-empty regions, 2) choosing an efficient way to traverse these regions (in another words choosing an efficient ordering of partial multiplications), 3) choosing the optimal type of storage for each region. All these 3 steps are tightly coupled. The first step divides the whole matrix into smaller parts (regions) those can fit in the cache. The second step improves locality during the multiplication due better utilization of distant references. The last step maximizes machine computation performance of the partial multiplication for each region. In this paper, we describe aspects of these 3 steps in more detail (including fast and time-inexpensive algorithms for all steps). Our measurements proved that our approach gives a significant speedup for almost all matrices arising from various technical areas. In general, there is only one way to improve the locality in SpMV [1,2]: by making the multiplication more cache-sensitive. In a typical numerical algorithm, the cache hit ratio is high if the cache can hold data accessed in the innermost loop of algorithms. Memory accesses that do not satisfy this assumption are called distant references. In this paper, we propose and explore a new method for increasing the locality of the SpMV consisting in 3 steps: 1. Dividing matrix A into disjoint nonempty rectangular regions. 2. Choosing a suitable storage format for the regions. 3. Choosing a good traversal of these regions during the SpMV . We divide matrix A into disjoint nonempty rectangular regions so that the cache can hold all data accessed during the partial SpMV within each region. In this paper, we assume for simplicity that regions are square submatrices of the same size r containing at least one nonzero element. An optimal value of r is hard to predict, because it depends on locations of nonzero elements and cache sizes. Regions are viewed as submatrices. Therefore, we can use common sparse matrix formats for storing nonzero elements inside them. We can use information that an element is inside a region, so we can express all coordinates relatively to the region. All coordinates can be stored by using smaller data-types (char or byte) instead of