IEEE TRANSACTIONS ON. CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1 Modeling of Rate and Perceptual Quality of Compressed Video as Functions of Frame Rate and Quantization Stepsize and Its Applications Zhan Ma, Student Member, IEEE, Meng Xu, Yen-Fu Ou, and Yao Wang, Fellow, IEEE Abstract—This paper ﬁrst investigates the impact of frame rate and quantization on the bit rate and perceptual quality of compressed video. We propose a rate model and a quality model, both in terms of the quantization stepsize and frame rate.Both models are expressed as the product of separate functions of quantization stepsize and frame rate. The proposed models are analytically tractable, each requiring only a few content- dependent parameters. The rate model is validated over videos coded using both scalable and non-scalable encoders, under a variety of encoder settings. The quality model is validated only for scalable video, although it is expected to be applicable to single-layer video as well. We further investigate how to predict the model parameters using the content features extracted from original videos. Results show accurate bit rate and quality prediction (average Pearson correlation > 0.99) can be achieved with model parameters predicted using three features. Finally, we apply rate and quality models for rate-constrained scalable bitstream adaptation and frame rate adaptive rate control. Simulations show that our model-based solutions produce better video quality compared with conventional video adaptation and rate control. Index Terms—Rate model, perceptual quality model, content feature, rate control, scalable video adaptation, H.264/AVC, SVC I. I NTRODUCTION A fundamental and challenging problem in video encoding is, given a target bit rate, how to determine at which spatial resolution (i.e., frame size), temporal resolution (i.e., frame rate), and amplitude (i.e., SNR) resolution (usually controlled by the quantization stepsize (QS) or consequently quantization parameter (QP)), to code the video. One may code the video at a high frame rate, large frame size, but high QS, yielding noticeable coding artifacts in each coded frame. Or one may use a low frame rate, small frame size, but small QS, producing high quality frames. These and other combinations can lead to very different perceptual quality. In traditional encoder rate- control algorithms, the spatial and temporal resolutions are pre-ﬁxed based on some empirical rules, and the encoder varies the QS to reach a target bit rate. Selection of QS is Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Z. Ma was with the Polytechnic Institute of New York University, Brooklyn, NY 11201, USA. He is now with the Dallas Technology Lab, Samsung Telecommunications America, Richardson, TX 75082, USA. (email: zhan.ma@ieee.org) M. Xu, Y.-F. Ou and Y. Wang are with the Polytechnic Institute of New York University, Brooklyn, NY 11201, USA. (email: {mxu02, you01}@students.poly.edu, yao@poly.edu) typically based on models of rate versus QS. When varying the QS alone cannot meet the target bit rate, frames are skipped as necessary. Joint decision of QS and frame skip has also been considered, but often governed by heuristic rules, or using the mean square error (MSE) [1] as a quality measure. Ideally, the encoder should choose the spatial, temporal, and amplitude resolution (STAR) that leads to the best perceptual quality, while meeting the target bit rate. Optimal rate control solution requires accurate rate and perceptual quality prediction at any STAR combination. In video streaming, the same video is often requested by receivers with diverse sustainable receiving rates. To address this diversity, a video may be coded into a scalable stream with many STAR combinations. Given a particular user’s sustainable rate, either the server or proxy needs to extract from the original bitstream a certain layers corresponding to a particular STAR combination to meet the rate constraint. This problem is generally known as the bitstream adaptation. Different combinations are likely to yield different perceptual quality. Here again the challenging problem is to determine which STAR to extract, to maximize the perceptual quality. The latest scalable video coding (SVC) standard [2] enables lightweight bitstream manipulation [3] and also can provide the state-of-the-art coding performance [4], by its network friendly interface design and efﬁcient compression schemes in- herited from the H.264/AVC [5]. However, before SVC video can be widely deployed for practical applications, efﬁcient mechanisms for SVC stream adaptation to meet different user constraints need to be developed. Optimal adaptation requires accurate prediction of the perceived quality as well as the total rate at any STAR combination. Although much work has been done in perceptual quality modeling and in rate modeling for video at a ﬁxed spatial and temporal resolution, the impact of spatial and temporal resolutions, on the perceptual quality and rate has not been studied extensively. Recently, several studies have examined the inﬂuence of spatial, temporal, and amplitude resolutions, individually or jointly, on the perceptual quality [6]–[9]. How- ever, some of these models require a lot of parameters, or have limited accuracy. To the best of our knowledge, no prior works have attempted to predict the rates corresponding to different STAR combinations, and none of the prior work have deployed rate and perceptual quality models to choose the best STAR combination for either video adaptation or encoder rate control. For encoder rate control, [1] has attempted to jointly consider quantization (for spatial quality) and frame rate (for