A Highly Parallel SAD Architecture for Motion Estimation in HEVC Encoder Ahmed Medhat, Ahmed Shalaby, Mohammed S. Sayed, Maha Elsabrouty Egypt-Japan University of Science and Technology P.O.Box 179, New Borg El-Arab City, Alexandria 21934, Egypt {ahmed.abdelsalam, ahmed.shalaby, mohammed.sayed, maha.elsabrouty}@ejust.edu.eg Farhad Mehdipour E-JUST Center, Kyushu University 3-8-33 Momochihama, Sawara-ku, Fukuoka 814-0001, Japan farhad@ejust.kyushu-u.ac.jp Abstract—The high computational cost of the motion estimation module in the new HEVC standard raises the need for efficient hardware architectures that can meet the real-time processing constraint. In addition, targeting HD and UHD resolutions increases the motion estimation processing cost beyond the capabilities of the currently existing architectures. This paper presents a highly parallel sum of absolute difference (SAD) architecture for motion estimation in HEVC encoder. The proposed architecture has 64 PUs operating in parallel to calculate the SAD values of the prediction blocks. It processes block sizes from 4x4 up to 64x64. The proposed architecture has been prototyped, simulated and synthesized on Xilinx Virtix-7 XC7VX550T FPGA. At 458 MHz clock frequency, the proposed architecture processes 30 2K resolution fps with ±20 pixels search range. The prototyped architecture utilizes 7% of the LUTs and 5% of the slice registers in Xilinx Virtex-7 XC7VX550T FPGA. Keywords—HEVC, inter prediction, SAD architecture, variable block size motion estimation (VBSME) I. INTRODUCTION Recent estimates indicate that more than 50% of current network traffic is compressed real-time video, and this share is expected to rise to 90% within a few years [1]. In addition, the growing popularity of high definition (HD) videos and beyond HD videos as well is creating stronger needs for better video compression efficiency. These facts and needs raise the demand for new video coding standard with high compression efficiency compared to the currently used H.264/MPEG-4 AVC standard. The new high efficiency video coding (HEVC) standard was introduced targeting to double the compression efficiency. It can achieve 50% bit rate saving compared to H.264/MPEG-4 AVC for the same video quality [2]-[3]. Quad tree structure is the fundamental feature that differentiates HEVC from MPEG-4 AVC. As shown in Fig. 1, HEVC is based on code tree unit (CTU) instead of the macroblock in H.264/MPEG-4 AVC. The size of CTU is variable, unlike traditional macroblock. CTU size is selected by the encoder and can be larger or smaller than a traditional macroblock [2]. Each CTU is partitioned into one or more code units (CUs), and each CU has an associated partitioning into prediction units (PUs) and a tree of transform units (TUs) [4]. Each PU is coded using either inter or intra prediction. Motion estimation (ME) and motion compensation are the major loads at video encoder. They consume more than 90% of encoding time [5]. Although, HEVC provides a simple inter prediction process, the overhead involved is larger compared to H.264/MPEG-4, which consequently increases the complexity of HEVC encoder [6]. Typically as in H.264/MPEG-4 AVC, for every prediction block (PB) in HEVC, block matching algorithm (BMA) finds the best matching block within a certain search window. In the last decade, many hardware architectures for ME were proposed. Nevertheless, most of these architectures endeavor was H.264/MPEG-4 AVC but not HEVC. In this paper, we propose a high performance hardware architecture, in terms of parallelism and computational complexity, for Sum of Absolute Difference (SAD) unit in HEVC. The proposed architecture has been implemented on FPGA and compared with other SAD units architectures for HEVC. Synthesis results show that the proposed architecture processes the video data with high processing rate than existing ones. Moreover, it can meet the requirements of 30 2K resolution frames per second (fps) real time video coding. The rest of the paper is organized as follows; HEVC ME is explained in section II. In section III, the related work is reviewed. The detailed description of the proposed SAD unit architecture is explained in section IV. Section V shows the simulation results and discussion. Finally, section VI concludes the paper. Fig. 1. HEVC quad-tree structure and subdivision of CTU into CUs