SIM 2011 26 th South Symposium on Microelectronics 41 Multiprocessing Acceleration of H.264/AVC Motion Estimation Full Search Algorithm under CUDA Architecture Eduarda R. Monteiro, Bruno B. Vizzotto, Cláudio M. Diniz, Bruno Zatt, Sergio Bampi {ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi}@inf.ufrgs.br Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil Abstract This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture, and to compare the performance with a theoretical model and two implementations (sequential and parallel using OpenMP library). We obtained a O(n²/log²n) speed-up which fits the theoretical model considering different search areas. It represents up to 600x gain compared to the serial implementation, and 66x compared to the parallel OpenMP implementation. 1. Introduction In the past decade, the demand for high quality digital video applications has brought the attention of industry and academy to drive the development of advanced video coding techniques and standards. It results in the publication of H.264/AVC [1], the state-of-the-art video coding standard, since it provides higher coding efficiency compared to previous standards (MPEG-2, MPEG-4, H.263). In order to achieve higher coding efficiency, the standard introduces increased computational complexity to implement the encoder and decoder. In this scenario, Motion Estimation (ME) [2] is a key issue in order to obtain high compression gains. It explores the temporal redundancy of a video by searching at reference frames the most similar region in the current frame. When the best ‘match’ occurs, a motion vector is calculated. It provides the most part of compression gains for H.264/AVC. ME algorithm and associated similarity criteria to determine the best ‘match’ is an important encoder issue not addressed in H.264/AVC. It requires intensive computation and memory communication for block matching task representing 80% of the total computational complexity of current video coders [3]. However, some block matching algorithms for ME have a great potential of parallelism. Full-search (FS) [1] is one of these algorithms, which performs the search for the best match exhaustively inside a search area. The best match is achieved by calculating the similarity for each block position inside the search area, using a similarity criterion, e.g. Sum of Absolute Differences (SAD). One can observe that the calculation of SAD for one block does not depend on the calculation of the previous block. So these two steps could be calculated simultaneously, in parallel. By exploring the inherent parallelism potential of ME FS algorithm and the huge computational capacity of recent Graphic Processing Units (GPUs), this work presents a parallel GPU-based solution for the FS block matching algorithm implemented on Compute Unified Device Architecture (CUDA), from NVIDIA [4]. We present here how we efficiently mapped the FS algorithm to the CUDA programming model. Further, the obtained results from the execution of FS implementation with real videos were compared with a serial and parallel OpenMP implementation. We also made a comparison with a theoretical complexity model in terms of computation and communication. 2. Motion Estimation The main goal of ME is to find in the previously reconstructed frames (reference frames) a block that more closely resembles the block of the current frame, thus reducing the temporal redundancy between frames to be transmitted. The displacements are mapped to motion vectors and associated residue. From one or more reference frames, a motion vector is generated for each block of the current frame and the corresponding block position with highest similarity to the reference frame. The optimum search to “match” the blocks is carried through exhaustively in a search area using a search algorithm and a similarity criterion. The search area is a region in the reference frame formed around the co- located position of the block to be coded, in the current frame. By the end of the search, the optimum block, i.e. the most similar block using the similarity criterion, is located and a motion vector is generated which indicates the position of this block in the reference picture. This process is better described in Fig. 1.