HIGH-THROUGHPUT INTERPOLATION HARDWARE ARCHITECTURE WITH COARSE-GRAINED RECONFIGURABLE DATAPATHS FOR HEVC Cláudio Machado Diniz 1,2 , Muhammad Shafique 1 , Sergio Bampi 2 , Jörg Henkel 1 1 Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany 2 Institute of Informatics, PPGC, Federal University of Rio Grande do Sul (UFRGS), Brazil {cmdiniz, bampi}@inf.ufrgs.br; {muhammad.shafique, henkel}@kit.edu, diniz@ira.uka.de ABSTRACT Fractional-pel interpolation for motion estimation and motion compensation is one of the key computational hotspots in the new High Efficient Video Coding (HEVC) standard. This work presents a high-throughput interpolation hardware architecture to improve performance of HEVC encoding and decoding. It employs two acceleration engines for luma and chroma filtering, each with 12-pel-parallel coarse-grained reconfigurable interpolation datapaths. An adaptive scheduling scheme manages the operation of these interpolation datapaths in different ways depending upon the prediction unit (PU) size and the execution scenario (i.e. motion estimation or motion compensation). We have implemented our hardware architecture in 150 nm technology. Compared to state-of-the-art techniques [12], our architecture required 49% less hardware area, while processing QFHD (3840x2160) resolution @ 30 fps. Index TermsHEVC, Interpolation Filter, Motion Estimation (ME), Motion Compensation (MC), Hardware Acceleration, Reconfigurable Datapaths 1. INTRODUCTION AND MOTIVATION The demand for ultra-high resolution video applications results in a new standardization effort called High Efficient Video Coding (HEVC) [1], developed by ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC). HEVC provides approx. 50% bit-rate reduction when compared to state-of-the-art H.264/AVC High Profile [2] while providing similar subjective video quality. HEVC is based on the same hybrid motion compensation/transform coding structure as H.264/AVC. Its coding efficiency is achieved through the use of larger block sizes (up to 64x64) for inter/intra prediction and a new quadtree structure to partition them down to blocks of 4x4 pixels in a hierarchical way. Other novel coding tools that improve the coding efficiency are: (1) 35 angular directions for intra prediction; (2) different transform sizes (from 4x4 to 32x32); (3) new motion compensation (MC) tools, e.g. motion vector (MV) competition, MV merging, 7- /8-tap interpolation filter for fractional-pel MC; (4) new tools for in-loop deblocking filter, etc. Through the use of this large set of coding tools, HEVC achieves higher coding efficiency than H.264/AVC at the cost of a significant increase in the computational complexity [3]. This is mainly due to the increased number of partitions and coding modes exercised by Rate-Distortion Optimized Mode Decision (RDO-MD), which controls the number of executions of Integer-/Fractional-pel Motion Estimation (IME/FME). There is also a slightly increase in the computational complexity of video decoder [3]. Real-time encoding and decoding for HEVC requires hardware acceleration. We perform profiling (using GNU gprof) of the HEVC reference encoder and decoder software (HM 9.0) [4] in order to identify the computational hotspots 1 of HEVC codec. Profiling provides an insight on the relative distribution of execution time of different coding tools of HEVC. The profiling was performed by encoding (and then decoding) 150 frames of People on Streetvideo sequence (2560x1600 pixels). Encoding was configured to Random Access (RA) with GOP = 32 and four different Quantization Parameter (QP) values defined in the Common Test Conditions [5]. All other configurations are kept as default except for Rate-Distortion Optimized Quantization (RDOQ) feature that was disabled. RDOQ feature reduces the bit-rate in only 4% but significantly increases the encoding complexity. Due to space limitation, we show profiling of only one video, but other videos have similar results. Fig. 1 shows the execution time distribution (in %) for each C++ class of encoder and decoder. This group of classes was selected because they represent more than 70% of execution time. In the encoder side, 50%-70% of encoding time is spent in RDO-MD for IME/FME and intra prediction (TComRdCost, TComInterpolationFilter). The interpolation filter for the fractional-pel motion compensation (TComInterpolationFilter), loop filter (TComLoopFilter, TComSampleAdaptiveOffset) and entropy decoding (TDecBinCABAC, TDecSbac, TDecEntropy) contribute together for 50%-60% of decoding time. Fig. 1 illustrates that 7-/8-tap interpolation filter (TComInterpolation class) for generating half-/quarter-pel for the fractional-pel ME/MC consumes 20%-30% of encoding time and 20%-40% of decoding time due to a significantly large number of multiplication/add operations. Hence, this is a computational hotspot in both HEVC encoder and decoder. Therefore, high-throughput hardware 1 Computational hotspots are the kernel functions in an application that consume most of the processing time. 2091 978-1-4799-2341-0/13/$31.00 ©2013 IEEE ICIP 2013