Predicting the Perceptual Quality of Networked Video through Light-Weight Bitstream Analysis Abdul Hameed, Rui Dai, Benjamin Balas North Dakota State University Email: {abdul.hameed, rui.dai, benjamin.balas}@ndsu.edu Abstract—With the exponential growth of video traffic over wireless networked and embedded devices such as mobile phones and sensors, mechanisms are needed to predict the perceptual quality of video in real time and with low complexity, based on which networking protocols can control video quality and optimize network resources to meet the quality of experience (QoE) requirements of users. This paper proposes an efficient and light-weight video quality prediction model through partial parsing of compressed video bitstreams. A set of features were introduced to reflect video content characteristics and distortions caused by compression and transmission. All the features can be obtained directly from the H.264/AVC compressed bitstream in parsing mode without decoding the pixel information in macro- blocks. Based on these features, an artificial neural network model was trained for perceptual quality prediction. Evaluation results show that the proposed prediction model can achieve accurate prediction of perceptual video quality through low computation costs. Therefore, it is well-suited for real time networked video applications on embedded devices. I. I NTRODUCTION In recent years we have witnessed exponential growth of various video applications over wireless networked and em- bedded devices such as mobile phones and sensors. Maintain- ing good visual quality for these applications is a focal concern of service providers and network designers for satisfying the quality of experience (QoE) requirements of end users. Moreover, for many applications, it is essential to guarantee good visual quality since users make critical decisions based on their visual observations, e.g., identifying intruders based on videos from a wireless surveillance network. The majority of multimedia networking protocols aim to satisfy quality of service (QoS) requirements, which are usually given in terms of bandwidth, delay, and packet loss ratio. However, these networking parameters alone do not necessarily reflect a user’s experience of viewing the received videos. According to many subjective test results, the mean opinion scores (MOS) given by viewers on distorted videos cannot be merely determined by bit rate and packet loss ratio [1]. Given a video compression algorithm, the perceptual quality of compressed videos under the same bit rate can vary with different video content characteristics, such as the level of spatial details (brightness, edges, texture complexity, etc.) and temporal details (e.g., the extent of motion). The distortion caused by transmission is related to the locations of lost packets in a bitstream, and the visibility of packet loss also significantly depends on the content of the video [2]. To achieve more effective control of QoE for various video applications, there is a demand for mechanisms that can predict perceptual video quality accurately and in real time. More specifically, the MOS of networked videos, which are collected from time-consuming subjective tests, should be predicted as functions of observable parameters from the video stream or the network. In particular, many inter-nodes in wireless networks, such as mobile phones and sensors, are embedded devices with limited processing power; therefore, quality prediction is expected to be conducted in a compu- tationally efficient way. Networking protocols can leverage the prediction to control video quality and optimize network resources to meet the QoE requirements of users. Perceptual video quality can be measured using reference- based or reference-free methods. Reference-based methods require access to the original source video (or quality features derived from the source video) to assess the quality of a compressed video [3], while reference-free methods assess video quality based on information from the compressed video without referring to the source video [4]. Since reference-based methods are complicated for implementation and cannot be used in cases where source videos are absent, reference-free methods are preferred for real-time monitoring and control of video quality in a network. Several reference-free quality prediction models have been introduced in the literature. The ITU-T recommendation G.1070 [5] provides a parametric model that estimates video quality using bit rate, frame rate, and percentage of packet loss. One major drawback of the model is that video content is not taken into consideration. In [4] and [6], video quality es- timation models were developed using regression techniques, and both models made use of detailed content information such as motion vectors (MV). The models in [7] and [8] estimated quality based on content features such as blockiness, blurriness, and extent of motion, which have to be extracted from a reconstructed (fully decoded) video. Extracting very detailed content information, such as motion vectors or even reconstructed pixels, makes the estimation process complex but not necessarily leads to better performance. In addition, QoE prediction models have been proposed for different types of wireless networks. A video quality estimator for UMTS networks was proposed in [9]. This model clustered video sequences into several groups based on content type, and video quality was estimated by a nonlinear function of content type, sender bit rate, block error rate, and mean burst