Toward Futuristic Near-Natural Collaborations on Distributed Multimedia Plays Architecture Leif Arne Rønningen, Mauritz Panggabean, and ¨ Ozg¨ ur Tamer Department of Telematics Norwegian University of Science and Technology (NTNU), N-7491, Trondheim, Norway {leifarne, panggabean, ozgurt}@item.ntnu.no Abstract—This paper presents a vision of how people can collaborate in real-time in the future from different places on the continents with near-natural quality of experience. To achieve that very high quality, the requirements are also very high and challenging. The ultimate challenge for its implementation is to guarantee end-to-end delay less than 10-20ms with graceful quality variability to enable live musical collaborations. The Distributed Multimedia Plays architecture with the AppTraNet protocol and the design of the collaboration spaces is our proposal for the realization of the vision, since existing Internet standards are unable to provide such guarantee. The relation of this novel proposal to the existing and future standards is also discussed. It is expected that significant milestones on this research avenue will be attainable in the next 5-10 years. Index Terms—Distributed Multimedia Plays Architecture, AppTraNet Protocol, Collaboration Spaces, Quality Shaping, Near-natural Quality of Experience. I. I NTRODUCTION Rapid advancements in electronics as well as information and communication technology (ICT) have unveiled the pos- sibility of new and more creative ways of real-time multi- party collaborations limited only by time and space. We envision such collaborations in the future with near-natural quality of experience through networked collaboration spaces that seamlessly combine virtual (taped) or live scenes from distributed sites on the continents that may have different tech- nical specifications to each other. For example, an audience in London are attending a concert from three opera singers in a specially designed room, namely a collaboration space. The multimedia quality that they experience is so close to natural that they don’t realize that two singers are singing live from two different cities, say Oslo and Amsterdam, each in their own collaboration space. The performance of another singer is played from a remote server, and yet the three singers perform together so harmoniously with life-like multimedia quality that the audience think they are enjoying a live opera concert and they are in the very same room with the three singers. Moreover, each opera singer performing live also experiences singing together with the other two as displayed in his or her own collaboration space, as if they are on the same stage. The main technical requirements on important aspects for the envisioned collaborations are listed in Table I. The ultimate challenge is ensuring that the maximum end-to-end delay is TABLE I THE MAIN TECHNICAL REQUIREMENTS FOR THE ENVISIONED COLLABORATIONS DERIVED FROM THE AIMED QUALITY OF EXPERIENCE Nr. Main technical requirements 1. Guaranteed maximum end-to-end delay ≤ 10-20ms 2. Near-natural video quality 3. Autostereoscopic multi-view 3D vision 4. High spatial and temporal resolution due to life-size dimension of objects i.e. mostly humans 5. Accurate representation of physical presence cues e.g. eye contact and gesture 6. Real 3D sound 7. Quality allowed to vary with time due to different technical speci- fications among collaboration spaces 8. Quality-variation guarantee 9. Graceful quality degradation due to traffic overload or failure 10. Privacy provided by defined security level around 10-20 ms to enable good synchronization in musical collaboration. From experimental results on the effect of time delay on ensemble accuracy, by placing pairs of musicians apart in isolated rooms and asking them to clap a rhythm together, it is shown that longer delays produced increasingly severe tempo deceleration and shorter delays produced a modest, but surprising acceleration [1]. The result indicates that the observed optimal delay for synchronization is 11.5 ms that equates with a physical radius of 2,400 km (assuming signals traveling at approximately 70% the speed of light and no routing delays). Realizing such collaborations with very high quality and complexity will be possible if the rest of the requirements can be fulfilled within the maximum time delay. To provide a glimpse of the complexity, let us zoom in to the requirements related only to vision. The human sense of depth in vision nowadays can be emulated by autostereoscopic multi-view 3D displays commercially available. As life-size dimension of objects in the scenes is a must to promote near- natural collaboration, one logical solution is to construct all surfaces of a collaboration space from arrays of such displays. Such real dimensions imply the use of high-definition spatial resolution which may result in extremely high data rates, even from a collaboration space alone. Providing accurate representation of physical presence cues in the processed and transmitted videos, e.g. eye contact and gesture, between parties involved in the collaboration adds more computational