732 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 5, MAY 2010 Fast Context-Adaptive Mode Decision Algorithm for Scalable Video Coding With Combined Coarse-Grain Quality Scalability (CGS) and Temporal Scalability Hung-Chih Lin, Wen-Hsiao Peng, and Hsueh-Ming Hang, Fellow, IEEE Abstract —To speed up the H.264/MPEG scalable video coding (SVC) encoder, we propose a layer-adaptive intra/inter mode de- cision algorithm and a motion search scheme for the hierarchical B-frames in SVC with combined coarse-grain quality scalability (CGS) and temporal scalability. To reduce computation but maintain the same level of coding efficiency, we examine the rate- distortion (R-D) performance contributed by different coding modes at the enhancement layers (EL) and the mode conditional probabilities at different temporal layers. For the intra prediction on inter frames, we can reduce the number of Intra4×4/Intra8×8 prediction modes by 50% or more, based on the reference/base layer intra prediction directions. For the EL inter prediction, the look-up tables containing inter prediction candidate modes are designed to use the macroblock (MB) coding mode de- pendence and the reference/base layer quantization parameters (Qp). In addition, to avoid checking all motion estimation (ME) reference frames, the base layer (BL) reference frame index is selectively reused. And according to the EL MB partition, the BL motion vector can be used as the initial search point for the EL ME. Compared with Joint Scalable Video Model 9.11, our proposed algorithm provides a 20× speedup on encoding the EL and an 85% time saving on the entire encoding process with negligible loss in coding efficiency. Moreover, compared with other fast mode decision algorithms, our scheme can demonstrate a 7–41% complexity reduction on the overall encoding process. Index Terms—Coarse-grain quality scalability, encoder opti- mization, fast mode decision, scalable video coding (SVC). I. Introduction I N RESPONSE to the increasing demand for scalability features in many applications, the Joint Video Team has recently, based upon H.264/advanced video coding (AVC) [1], standardized a scalable video coding standard (referred hereafter to as SVC) [2], [3] that furnishes spatial, temporal, signal-to-noise ratio (SNR) and their combined scalabilities Manuscript received April 8, 2009; revised September 16, 2009. First version published January 29, 2010; current version published May 5, 2010. This work was supported in part by the National Science Council, Taiwan, under Grants NSC 96-2221-E-009-063, NSC 95-2221-E-009-146, and NSC 95-2221-E-009-071. This paper was recommended by Associate Editor V. Bottreau. H.-C. Lin and H.-M. Hang are with the Department of Electronics Engi- neering, National Chiao-Tung University (NCTU), Hsinchu 30010, Taiwan (e-mail: huchlin@gmail.com; hmhang@mail.nctu.edu.tw). W.-H. Peng is with the Department of Computer Science, National Chiao- Tung University, Hsinchu 30010, Taiwan (e-mail: pawn@mail.si2lab.org). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2010.2045832 within a fully scalable bit stream. By employing multilayer coding along with hierarchical temporal prediction [4], [5], the SVC encodes a video sequence into an inter-dependent set of scalable layers, allowing a variety of viewing devices to perform discretionary layer extraction and partial decod- ing according to their playback capability, processing power, and/or network quality. As a scalable extension to H.264/AVC, the SVC inherits all the coding tools of H.264/AVC and additionally it incorporates an adaptive inter-layer prediction mechanism for reducing the coding efficiency loss relative to the single-layer coding. A superior coding efficiency is achieved with little increase in decoding complexity by means of the so-called single-loop decoding. These key features distinguish the SVC from the scalable systems in the prior video coding standards. Although the decoding complexity was well studied and amended during the design phase of the SVC, its encoding complexity has rarely been addressed. An SVC encoder, the operations of which are non-normative, can be quite flexible in its implementation, as long as its bit streams conform to the specifications. The current Joint Scalable Video Model (JSVM) v.9 [6] uses a bottom-up encoding process that adopts the exhaustive mode search for coder parameter selection. The exhaustive search strategy, though providing a good rate- distortion (R-D) performance, spends a large amount of com- putations on evaluating all possible coding options and it turns out that most of these options have little benefit in increasing coding efficiency. For example, in a typical encoding exper- iment with the combined temporal and coarse-grain quality scalability (CGS), it takes about 10–40 min of central process- ing unit (CPU) time (see the test conditions in Section V), de- pending on the number of enhancement layers (EL), to encode a two-second common intermediate format (CIF) video clip. A further study reveals that a large percentage of computations come from encoding EL; more specifically, a CGS EL requires approximately three times the computations of its base layer (BL) due to the extra motion search for inter-layer motion estimation and residual prediction. A fast encoding algorithm is thus desirable and advisable for reducing the EL computa- tional complexity without sacrificing the R-D performance. An effective way to reduce the encoding complexity is to restrict the number of candidate modes. There exists a large 1051-8215/$26.00 c 2010 IEEE