978-3-9815370-0-0/DATE13/©2013 EDAA Hardware-Software Collaborative Complexity Reduction Scheme for the Emerging HEVC Intra Encoder Muhammad Usman Karim Khan, Muhammad Shafique, Mateus Grellert, Jörg Henkel Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany {muhammad.khan, muhammad.shafique, henkel}@kit.edu Abstract—High Efficiency Video Coding (HEVC/H.265) is an emerging standard for video compression that provides almost double compression efficiency at the cost of major computational complexity increase as compared to current industry-standard Advanced Video Coding (AVC/H.264). This work proposes a collaborative hardware and software scheme for complexity reduction in an HEVC Intra encoding system, with run-time adaptivity. Our scheme leverages video content properties which drive the complexity management layer (software) to generate a highly probable coding configuration. The intra prediction size and direction are estimated for the prediction unit which provides reduced computational-complexity. At the hardware layer, specialized coprocessors with enhanced reusability are employed as accelerators. Additionally, depending upon the video properties, the software layer administers the energy management of the hardware coprocessors. Experimental results show that a complexity reduction of up to 60 % and the energy reduction up to 42 % are achieved. I. INTRODUCTION AND MOTIVATION Digital video compression is a fundamental requisite of many day-to-day applications, like video conferencing, security and entertainment. Due to the ever increasing trend of video resolutions (from Full HD 1920×1080 to Quad Full HD 4096×2048 and Ultra HD 7680×4320) and frame rates (30 FPS to 60/120 FPS), the Joint Collaborative Team on Video Coding (JCT-VC) have recently developed the next generation video coding standard, called the High Efficiency Video Coding (HEVC, also termed as H.265) [1]. The goal of HEVC is to increase the compression efficiency by 50% as compared to that of the H.264. This coding efficiency is achieved by introducing additional coding tools and accompanies a tremendous increase in the computational complexity. Unlike the H.264’s concept of a Macroblock (MB, 16×16 region of the video frame used as a primary compression unit), HEVC implements a Quad Tree Coding structure (see Fig. 1), called the Coding Tree Blocks (CTB). The concept of MBs is replaced by the Largest Coding Unit (LCU) which can be recursively divided into 4 Coding Units (CU) of size 2N×2N. The LCU is subdivided into every possible block partition size (CU size) and the best combination of CU sizes is selected, by comparing the Rate-Distortion (RD) cost of one combination to others (the process is termed as RD Optimization (RDO)). A CU can be further subdivided into Prediction Units (PU) (of size 2N×2N or N×N) and Transform Units (TU). Intra-video encoders exploit redundancies of video sequence only in the spatial domain. These encoders are well-suited to low latency applications like automotive, and high quality archiving solutions to remove motion artifacts. For HEVC Intra-encoding, a PU defines the basic entity for intra prediction, confining itself to the available many angular directions, DC and planar modes [1]. The PU partition for a CU and the best prediction mode are collectively called the coding configuration of the CU. CU 0 CU 1 CU 2 CU 3 64×64 (LCU) 32×32 16×16 8×8 8×8 4×4 Final PU Decomposition 4×4=N×N Others=2N×2N 1) Max CU size = LCU size 2) Min CU size = 8×8 3) Only 8×8 CU can have 4 4×4 PUs List of Abbreviations CTB Coded Tree Blocks LCU Largest Coding Unit CU Coding Unit PU Prediction Unit Fig. 1: One of the possible CU decomposition in HEVC where a CU is recursively converted into sub-CUs and PUs Analysis and Problem: This enormous decision space for selecting a RDO coding is required for increased compression efficiency. However, the iterative and recursive behavior of RDO optimization incurs significant complexity overhead, even for intra-only encoders, because the RDO decision has to recursively check each possible PU and intra mode combination. It is note- worthy that the total number of mode combinations in HEVC is ~42.4× more than that in H.264. Our experiments in Fig. 2 show that the computational complexity of the complete Intra-only HEVC has increased by a factor of ~1.4× for a compression efficiency increase of around 35% as compared to Intra-only H.264. A similar analysis can be found in [6]. Note, for a Full HD (1920×1080) video, it took approximately 83 seconds to encode one intra-frame on an Intel Core-2-Duo processor with 4 GB RAM which illustrates a significant challenge towards fast HEVC encoders. Therefore, it is vital to develop complexity reduction algorithms to realize real- world applications based on the HEVC intra encoders. The coding complexity illustrates that hardware solutions are required in embedded video coding systems to fulfill the real-time encoding demands for HEVC. But a hardware-only solution of HEVC will have long time-to-market due to the time consuming full custom design cycle. The development of software-only solution for HEVC encoding is fast and flexible, but its throughput is low. Recently, a number of state-of-the-art HEVC intra encoders have been proposed, e.g. [7]. In [3], the authors proposed an HEVC Intra prediction HW for only 4×4 blocks. The work in [4] presents a gradient based fast intra mode decision for a given PU size. In [5], authors have also presented a fast partition size selection algorithm for inter-frames exploiting temporal correlations for frame compression. These methods try to alleviate pressure off the encoding modules by performing sub- optimal encoding and using hardware-only solutions, thus limiting the flexibility of the architectures and resulting in larger energy, area and memory overhead. Our Novel Contributions: To satisfy the real-time throughput constraints of HEVC intra-encoding and to reduce energy