A Low-cost Approach Towards Streaming 3D Videos of Large-scale Sport Events to Mixed Reality Headsets in Real-time Kevin Marty * Prithvi Rajasekaran † Yongbin Sun ‡ Klaus Fuchs § Auto-ID Labs MIT & ETHZ ABSTRACT Watching sports events via 3D- instead of two-dimensional video streaming allows for increased immersion, e.g. via mixed reality headsets in comparison to traditional screens. So far, capturing 3D video of sports events required expensive outside-in tracking with numerous cameras. This study demonstrates the feasibility of streaming sports content to mixed reality headsets as holographs in real-time using inside-out tracking and low-cost equipment only. We demonstrate our system by streaming a race car on an indoor track as 3D models, which are then rendered in an Magic Leap One headset. An onboard camera, mounted on the race car provides the video stream used to localize the car via computer vision. The local- ization is estimated by an end-to-end convolutional neural network (CNN). The study compares three state-of-the-art CNN models in their respective accuracy and execution time, with PoseNet+LSTM achieving position and orientation accuracy of 0.35m and 3.95 ◦ . The total streaming latency in this study was 1041ms, suggesting technical feasibility of streaming 3D sports content, e.g. on large playgrounds, in near real-time onto mixed-reality headsets. Index Terms: Augmented Reality—Visualization—Head mounted display—Sport streaming; Deep learning—Image processing— Pattern recognition—Localization 1 I NTRODUCTION Despite allowing for increased immersion, research on capturing and streaming three-dimensional (3D) video to consumer devices has been under-discussed and not yet been adopted by content creators or developers. With the advent of augmented, virtual and mixed reality devices (altogether referred to as XR) , consuming three-dimensional (3D) video content has become more accessible than ever before. Compared to watching two-dimensional (2D) video on traditional screens, 3D video streaming on XR devices allows for increased perceived immersion in relation to the displayed content. In fact, 3D video streaming allows for content to be perceived as more vivid, salient, enjoyable and interactions as more natural compared to 2D videos [2]. In order to be perceived as enjoyable, such XR applications have to run at least 60 frames per second for a decent user experience and at least 30 frames per second as a minimum requirement to ensure stable and smooth movements of the displayed holograms [15]. A large latency in the streaming pipeline would result in a poor update of the live event and the user would miss fast-changing situations. Therefore, it is key to get the actual position of the sport agent frequently, requiring fast capturing and processing of video feeds to produce a 3D video feed in near real-time, allowing viewers to observe sports events as they unfold. Surprisingly, 3D streaming of applications or videos in * e-mail: martyk@ethz.ch † e-mail: prithvir@mit.edu ‡ e-mail: yb sun@mit.edu § e-mail: fuchsk@ethz.ch real-time has been lacking focus of attention among researchers and practitioners [14]. Despite the recent developments and the rapidly increasing advances in hardware and software, the commercial breakthrough to towards mass adoption has not yet been observed, as most applications still remain rather simple prototypes [11]. A signiﬁcant barrier towards capturing 3D video content of sports event is the technical requirements, as the practice requires outside- in tracking via numerous, expensive cameras, preventing most sports events to stream 3D content in real-time. Examples of 3D video applications in the sports domain include the commercial product FreeD’s Replay [1]. To create volumetric replay, high tech camera equipment is necessary. For example, to stream a 3D tennis game match, 28 cameras with a 5K resolution have to be installed around the tennis court [22]. The equipment cost for 3D modeling with multi-view cameras to stream 3D sport events scales with the size of the playground. As an example, for a soccer stadium, 38 ultra- high-deﬁnition cameras are necessary to capture the entire soccer ﬁeld as an outside-in tracking shown in ﬁgure 1. Unfortunately, most current approaches of modeling 3D sport events still rely on outside-in capturing of 3D video provided by multiple static cameras around the sports ﬁeld. Figure 1: Outside-in tracking of a soccer stadium with 38 ultra- high-deﬁnition cameras Figure 2: Inside-out tracking of a race track with one onboard cam- era mounted on the race car Computer vision can support generating 3D videos of sports events at substantially lower costs by enabling systems to infer 3D content by leveraging the current location of players (e.g. humans, race car) from 2D video feeds. Recently, a study demonstrated feasibility of converting 2D YouTube videos of historic soccer matches into 3D videos [18], thereby not only enabling reviewing old soccer matches in 3D, but also allowing generating 3D videos at low costs. To infer the location of players on a ﬁeld, a convolutional neural network is trained on 3D data, extracted from soccer video games to estimate the depth map of each player in every pose. After localizing the player on the ﬁeld and estimate the pose, the trained neural network calculates the corresponding depth map. Their solution uses the ﬁeld localization approach, which only works on ﬁelds with dominant visual features, such as a soccer ﬁeld that has pre-deﬁned layouts (i.e. white lines, green grass, four corners). Therefore, the approach of [18] is limited for playgrounds with dense static mono cameras around the ﬁeld, not allowing for capturing sports content on larger outdoor areas, e.g. race tracks, and requiring the system to be calibrated for a speciﬁc ﬁeld outline only. 254 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) 978-1-7281-6532-5/20/$31.00 ©2020 IEEE DOI 10.1109/VRW50115.2020.0-223 Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on May 14,2020 at 15:27:52 UTC from IEEE Xplore. Restrictions apply.