Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices Leonid Djinevski, Sime Arsenovski FON University 1000 Skopje, Macedonia Email: leonid.djinevski, sime.aresnovski@fon.edu.mk Sasko Ristov and Marjan Gusev Ss. Cyril and Methodius University 1000 Skopje, Macedonia Email: sashko.ristov, marjan.gushev@finki.ukim.mk Abstract—Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason is the cache storage organization, i.e. the negative performance peak appears caused by mapping of matrix elements onto one cache set, instead of using the whole cache. The obtained experimental results prove our theoretical analysis. We also propose a method to avoid situations where performance drawbacks appear. Index Terms—Cache Memory, SIMD, GPGPU. I. I NTRODUCTION Matrix multiplication is widely used algorithm in many computing applications. Its execution directly depends on cache memory architecture and organization. The performance depends on the following cache parameters: cache size, re- placement policy, cache levels, cache-line size, cache inclusiv- ity, cache associativity etc. Many hardware architectures exist for faster matrix multiplication algorithm execution, such as: supercomputers, grids, data-flow computing, cloud computing, GPUs, etc. Deeper understanding of their organization can lead to significant performance improvement. GPU devices have recently provided a massive acceleration and together with their low cost have brought a significant processing power inside regular PCs. Their architecture is intended to maximize the throughput without concerning the latency of certain thread. GPUs are more appropriate for applications with regular data access patterns, and not for more complicated scatter/gather access patterns [1]. Three main cache types exist: direct map, fully associative or set associative. In this paper we focus on GPU architecture with its set associative cache memory storage pattern in order to understand the performance drawbacks for particular matrix size and to improve the matrix multiplication algorithm. The rest of the paper is organized as follows. In Section II we give overview of related work in the area of the research problem. Section III briefly presents the GPU memory archi- tecture. Theoretical analysis of possible performance draw- backs is presented in Section IV and a description of method- ology used in the experiments in Section V. The results of the experiments are elaborated in Section VI. Finally, we conclude our work followed by recommendations in Section VII. II. RELATED WORK The matrix elements storage pattern has a strong impact on the matrix multiplication performance on GPU [2]. Likun and Dingfang [3] propose a mechanisms how to avoid cache race and cache split in order to improve the GPU GFlops. Tang et al. [4] optimize the cache locality on GPU modeling the cache miss analysis. However, they ignore the data reuse among the concurrent thread blocks on the SM. Volkov and Demmel [5] present detailed benchmarks of the GPU memory system, kernel start-up costs, and arithmetic throughput on dense matrix operations. Cache set associativity can provide huge performance draw- backs in matrix multiplication for particular matrix size. Reac- tive mechanisms (selective displacement and feedback) [6] and way prediction [7] can improve set-associative cache access times. Greater set associativity will reduce the cache misses, but still will not improve the performance since this will increase the cache hit access time. Padding the first element the second matrix will amortizes the performance drawback due to cache associativity [8]. Hongil [9] selects dynamically an optimized replacement policy for each cache set via workload speculation mechanism to improve the cache performance. Gusev and Ristov [10] proved both theoretically and ex- perimentally that CPU Cache memory storage pattern can significantly reduce the performance of matrix multiplication increasing the generation of last level cache misses due to usage of set associative cache. Using their theorems one can determine the matrix sizes where maximum cache per- formance drawback in the matrix multiplication algoriothm will be generated due to matrix storage pattern in a n-way associative memory. In this paper we have used the theorems and experimentally proved that they hold for set associative cache in GPU architectures. The latest GPUs have also two level cache hierarchy orga- nized with set cache associativity. Matsumoto et. al [11] de- termine huge performance drawbacks of DGEMM for matrix size that are in multiples of 1024 without deeper explanation. Two problems are exposed with usage of the caches, Cache capacity problem refers to the lack of the resources, while the cache associativity problem refers to inefficient usage of the cache. In this paper we are focused on performance analysis of the cache associativity problem. MIPRO 2013/DC-VIS 213