HIGH-THROUGHPUT INTERPOLATION HARDWARE ARCHITECTURE
WITH COARSE-GRAINED RECONFIGURABLE DATAPATHS FOR HEVC
Cláudio Machado Diniz
1,2
, Muhammad Shafique
1
, Sergio Bampi
2
, Jörg Henkel
1
1
Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany
2
Institute of Informatics, PPGC, Federal University of Rio Grande do Sul (UFRGS), Brazil
{cmdiniz, bampi}@inf.ufrgs.br; {muhammad.shafique, henkel}@kit.edu, diniz@ira.uka.de
ABSTRACT
Fractional-pel interpolation for motion estimation and
motion compensation is one of the key computational
hotspots in the new High Efficient Video Coding (HEVC)
standard. This work presents a high-throughput interpolation
hardware architecture to improve performance of HEVC
encoding and decoding. It employs two acceleration engines
for luma and chroma filtering, each with 12-pel-parallel
coarse-grained reconfigurable interpolation datapaths. An
adaptive scheduling scheme manages the operation of these
interpolation datapaths in different ways depending upon the
prediction unit (PU) size and the execution scenario (i.e.
motion estimation or motion compensation). We have
implemented our hardware architecture in 150 nm
technology. Compared to state-of-the-art techniques [12],
our architecture required 49% less hardware area, while
processing QFHD (3840x2160) resolution @ 30 fps.
Index Terms—HEVC, Interpolation Filter, Motion
Estimation (ME), Motion Compensation (MC), Hardware
Acceleration, Reconfigurable Datapaths
1. INTRODUCTION AND MOTIVATION
The demand for ultra-high resolution video applications
results in a new standardization effort called High Efficient
Video Coding (HEVC) [1], developed by ITU-T/ISO/IEC
Joint Collaborative Team on Video Coding (JCT-VC). HEVC
provides approx. 50% bit-rate reduction when compared to
state-of-the-art H.264/AVC High Profile [2] while providing
similar subjective video quality. HEVC is based on the same
hybrid motion compensation/transform coding structure as
H.264/AVC. Its coding efficiency is achieved through the use
of larger block sizes (up to 64x64) for inter/intra prediction
and a new quadtree structure to partition them down to blocks
of 4x4 pixels in a hierarchical way. Other novel coding tools
that improve the coding efficiency are: (1) 35 angular
directions for intra prediction; (2) different transform sizes
(from 4x4 to 32x32); (3) new motion compensation (MC)
tools, e.g. motion vector (MV) competition, MV merging, 7-
/8-tap interpolation filter for fractional-pel MC; (4) new tools
for in-loop deblocking filter, etc.
Through the use of this large set of coding tools, HEVC
achieves higher coding efficiency than H.264/AVC at the
cost of a significant increase in the computational
complexity [3]. This is mainly due to the increased number
of partitions and coding modes exercised by Rate-Distortion
Optimized Mode Decision (RDO-MD), which controls the
number of executions of Integer-/Fractional-pel Motion
Estimation (IME/FME). There is also a slightly increase in
the computational complexity of video decoder [3].
Real-time encoding and decoding for HEVC requires
hardware acceleration. We perform profiling (using GNU
gprof) of the HEVC reference encoder and decoder software
(HM 9.0) [4] in order to identify the computational hotspots
1
of HEVC codec. Profiling provides an insight on the relative
distribution of execution time of different coding tools of
HEVC. The profiling was performed by encoding (and then
decoding) 150 frames of “People on Street” video sequence
(2560x1600 pixels). Encoding was configured to Random
Access (RA) with GOP = 32 and four different Quantization
Parameter (QP) values defined in the Common Test
Conditions [5]. All other configurations are kept as default
except for Rate-Distortion Optimized Quantization (RDOQ)
feature that was disabled. RDOQ feature reduces the bit-rate
in only 4% but significantly increases the encoding
complexity. Due to space limitation, we show profiling of
only one video, but other videos have similar results.
Fig. 1 shows the execution time distribution (in %) for
each C++ class of encoder and decoder. This group of
classes was selected because they represent more than 70%
of execution time. In the encoder side, 50%-70% of
encoding time is spent in RDO-MD for IME/FME and intra
prediction (TComRdCost, TComInterpolationFilter). The
interpolation filter for the fractional-pel motion
compensation (TComInterpolationFilter), loop filter
(TComLoopFilter, TComSampleAdaptiveOffset) and entropy
decoding (TDecBinCABAC, TDecSbac, TDecEntropy)
contribute together for 50%-60% of decoding time.
Fig. 1 illustrates that 7-/8-tap interpolation filter
(TComInterpolation class) for generating half-/quarter-pel
for the fractional-pel ME/MC consumes 20%-30% of
encoding time and 20%-40% of decoding time due to a
significantly large number of multiplication/add operations.
Hence, this is a computational hotspot in both HEVC
encoder and decoder. Therefore, high-throughput hardware
1
Computational hotspots are the kernel functions in an application
that consume most of the processing time.
2091 978-1-4799-2341-0/13/$31.00 ©2013 IEEE ICIP 2013