Understanding of Human Perceptual Quality in Tele-immersive Shared Activity Zixia Huang, Ahsan Arefin, Pooja Agarwal, Klara Nahrstedt, Wanmin Wu Department of Computer Science University of Illinois at Urbana-Champaign {zhuang21, marefin2, pagarwal, klara, wwu23}@illinois.edu ABSTRACT Both comparative category rating (CCR) and degradation category rating (DCR) methods [20] have been heavily em- ployed in the subjective evaluations of media systems. The resulting metrics, comparative mean-opinion-score (CMOS) and degradation mean-opinion-score (DMOS), can be used to describe the system subjective quality. However, the sub- jective metrics may work unsuccessfully when the variance of participant votes is large. The diversity in human in- terests can appear due to the tradeoffs of multiple quality dimensions, which concurrently dominate the overall qual- ity of the media system. In this paper, we conduct a user study with 19 participants to evaluate the subjective quality of two tele-immersive shared activities (TISA), where media samples of different qualities are evaluated in case of each activity. Our study aims at (1) showing the effectiveness and limitation of CMOS and DMOS using real subjective data, and (2) demonstrating the heterogeneous impacts of TISAs on human perceptions. Categories and Subject Descriptors H.1.2 [Information Systems]: Models and Principles: Hu- man factors; H.4.3 [Information Systems Applications]: Communications Applications: Computer conferencing, tele- conferencing, and videoconferencing General Terms Experiment, Measurement Keywords 3D Tele-immersion, Subjective Quality Assessment 1. INTRODUCTION Researchers usually propose objective metrics to describe the quality of service (QoS) of media applications in various aspects. However, these QoS metrics alone are unable to Technical Report, University of Illinois at Urbana-Champaign Department of Computer Science Submitted on Dec 12, 2011. characterize the human perceptions, and it can be difficult to formulate their combined effects in a closed form. Hence, subjective evaluations are needed to evaluate real quality of experience (QoE) in media applications and guide the system adaptations. Lots of subjective studies [6, 10, 41, 43, 44, 46] have em- ployed the absolute category rating (ACR) method proposed in ITU-T BT.500 [15], in which participants observe one sin- gle media sample and give an ACR score from 1 to 5 (a higher score is better). The average of user voting scores is computed as the mean-opinion-score (MOS). However, the problem of ACR is that a standard rating scale is missing due to the absence of a reference sample (i.e., a prescribed sample with the best possible quality). Thus, the partici- pants in the studies usually give a score based on their own expertise. This leads to the non-uniform distributions of rating scores, which can invalidate the subjective results. To address the ACR drawback, ITU-T P.910 [20] proposes an alternative assessment method, in which participants now observe two media samples and give a comparative rating score. This can be either the degradation category rating (DCR) in which a degraded sample is compared against a reference sample, or the comparative category rating (CCR) in which any two media samples with different qualities are compared together. Participants give voting scores in the comparison process (details in Section 3), and the resulting average score is either the degradation mean-opinion-score (DMOS) or the comparative mean-opinion-score (CMOS). In this sense, DCR can be looked at as a subset of CCR, and CMOS can be used to approximate DMOS (Section 3.2). Several studies [13, 16, 17, 34] have utilized DCR and CCR in their subjective evaluations. While CCR (and DCR) can generally perform far better than ACR in terms of rating scaling uniformity, we argue that its resulting subjective metric CMOS is unable to cap- ture the variance of user votes. The problem is that the quality of a media system can be concurrently dominated by multiple quality dimensions (i.e., video frame rate, one- way delay, etc.). Hence, the tradeoffs among these dimen- sions in a comparison test can trigger the diversity of human preferences (Section 3.2), which has been demonstrated in our past VoIP studies [13, 34]. Note that the problem can only happen in CCR when multi-dimensional quality trade- offs exist, so neither media sample in the comparison is the reference. No other study has investigated the CCR issue in the interactive video systems though. Contributions. The problem of CMOS in capturing the user interest diversity has motivated us to evaluate the hu-