Joint feature-based visual quality assessment D.-O. Kim, R.-H. Park and D.-G. Sim A feature-based visual quality measurement method is presented as a reduced-reference approach. The proposed metric, called the joint feature similarity metric, is based on three types of point features extracted from the edge, planar and corner regions. It can characterise video quality in homogeneous and edge regions simultaneously. The effectiveness of the proposed visual quality assessment algorithm is shown with abundant sets of test video sequences. Introduction: Several video quality assessment algorithms have been proposed, aiming to quantify the subjective quality of distorted videos [1, 2]. Depending on availability of reference images, image quality assess- ment algorithms are classiﬁed into three categories: full-reference (FR), reduced-reference (RR) and no reference (NR) algorithms. The structural similarity (SSIM) [1] and the edge peak-signal-to-noise ratio (EPSNR) measure [2] were presented as the FR algorithms. Both the SSIM and EPSNR are related to the human visual system, noting that people evaluate image quality based on structural information rather than pixel intensities themselves. In practical real-time video services, FR algorithms cannot be used to assess video quality at a decoder side because a reference video is not available to the client. However, RR algorithms make use of feature values of reference videos rather than whole reference videos them- selves. Fig. 1 illustrates a common block diagram of conventional RR algorithms. In this Letter we propose a new RR algorithm employing joint visual features based on the human visual perception. Fig. 1 Block diagram of conventional RR algorithm Joint feature similarity metric: For most real applications, reference videos are not available at a decoder side. However, its compact features can be transmitted through an ancillary channel, which enables us to assess the quality of a distorted video at the receiver side. In this Letter, a joint feature similarity metric (JFSM) is proposed by introducing three types of point features based on the Harris corner detector [3]. The proposed point features are extracted with two steps. First, an image I is ﬁltered by a highpass ﬁlter to obtain directional derivative images I x and I y . Secondly, eigenvalues l 1 and l 2 of a symmetric matrix C are computed. Note that the matrix C is deﬁned by C ¼ P W ðI x ðx i ; y i ÞÞ 2 P W ðI x ðx i ; y i ÞI y ðx i ; y i ÞÞ P W ðI x ðx i ; y i ÞI y ðx i ; y i ÞÞ P W ðI y ðx i ; y i ÞÞ 2 2 4 3 5 where I x and I y denote partial derivatives of I with respect to x and y, respectively, and (x i , y i ) represents pixel position in a window (W). The geometrical structure of a point in an image can be described by the eigenvalues. That is, if both eigenvalues l 1 and l 2 are small (the maximum of two values (l max ) is smaller than a threshold th p ), the point is considered to be in a planar region. If both eigenvalues are large (the minimum of two values (l min ) is larger than a threshold th c ), the point is deﬁned as a corner point. If one eigenvalue is large and the other is small (the ratio g ( ¼ l max =l min ) of two eigenvalues is larger than a threshold th e ), the point is regarded as an edge point. In this Letter, three different types of point features are extracted to quantify the video quality because human beings are more interested in the regions near edges or corner points in a high-quality video rather than the planar regions in a low-quality video. Thus, we propose a perceptual video quality measurement that is extracted from edge and corner regions having high-frequency characteristics and planar regions with low-frequency characteristics. That is, a point in an image is described by a feature value and a feature type whether the point is in the edge, planar or corner region. The proposed quality measures are deﬁned as M e ¼ g r g d g r þ g d ; M p ¼ l max r l max d l max r þ l max d ; M c ¼ l min r l min d l min r þ l min d where M e , M p and M c represent individual quality measures deﬁned over edge, planar and corner regions and subscripts r and d denote reference and distorted videos, respectively. The proposed quality metric, JFSM, is then evaluated by combining three measures and deﬁned by JFSM ¼ a ðMM e þ MM p þ MM c Þþ b where MM e , MM p and MM c represent individual mean values of three quality measures computed over extracted feature points of each type and a and b denote constants. Experimental results and discussion: The goal of quality measurement algorithms is to quantify the subjective quality of image or video. Therefore, the objective quality measurements should be consistent with those of human beings. In this Letter, we show the effectiveness of the proposed algorithm by comparing the measurement values of the proposed algorithm and differential mean opinion score (DMOS) values. If the quality measurement values are linearly proportion to the DMOS values, the quality measurements are considered to be accurate. For the ground truth DMOS values, we generated the mean opinion score (MOS) dataset with 140 video clips that were compressed by the standard video codecs (H.263 and H.264=AVC). Note that the length of each sequence is 30 s. MOS values were obtained from the subjective test that is based on the double-stimulus continuous quality-scale (DSCQS) method presented by the ITU-T Recommendation BT.500-11 [4]. Note that 30 people participated in this experiment. Thirty (120) points of each feature type were used for QCIF (CIF) images. Our experiments were performed through two phases: training and test phases. First, 50 video sequences out of 140 sequences were randomly selected as a training set. Based on the linear regression, two coefﬁcients (a and b) of the JFSM were computed with the measurements and DMOS values for the 50 sequences. Table 1 shows the performances of the JFSM, SSIM, and EPSNR algorithms in terms of the Pearson correlation coefﬁcient [5], the sum of absolute error (SAE), and the required data size for a reference video. Pearson correlation and SAE values are computed by comparing the proposed quality measurement with the DMOS values. As shown in Table 1, the Pearson correlation coefﬁcient of the JFSM is more correlated with the DMOS values by 13.94% (39.71%) than SSIM (EPSNR). The SAEs between the JFSM values and the DMOS values is also 6.74% (15.08%) smaller than those between the SSIM (EPSNR) and the DMOS values. Furthermore, we found that the JFSM yields much better accuracy in terms of the Pearson correlation as well as the SAE with the data size of about only 30% (90  8.25=(176  144)) of the conventional algorithms. In the case of the QCIF image, the data size to be transmitted to a client side is 90 pixels  8.25 bytes=pixel (4 bytes for a pixel position (x, y), 2 bits for the feature type, and 4 bytes for a feature value (one among l max , l min , and g according to the feature type) in ﬂoating number). In the same manner, 360 pixels  8.25 bytes=pixels are required for the CIF image. Note that the data representation can be more optimised with an entropy coding method. Whereas, in the cases of the SSIM and EPSNR, the data size required to measure the visual quality is 25 344 (101 376) bytes equal to the image size for a QCIF (CIF) image. Table 1: Comparison of quality metrics with DMOS in terms of Pearson correlation, SAE and required data size (training set) SSIM EPSNR JFSM Pearson correlation 0.667 0.544 0.760 SAE 5.150 5.656 4.803 Remarks (data size (%)) 100 100 2.93 Figs. 2a and b illustrate the quality measure values (DMOS, SSIM, EPSNR and JFSM) against bit rate for Akiyo and Carphone sequences ELECTRONICS LETTERS 11th October 2007 Vol. 43 No. 21