A Low-cost Approach Towards Streaming 3D Videos of Large-scale Sport
Events to Mixed Reality Headsets in Real-time
Kevin Marty
*
Prithvi Rajasekaran
†
Yongbin Sun
‡
Klaus Fuchs
§
Auto-ID Labs MIT & ETHZ
ABSTRACT
Watching sports events via 3D- instead of two-dimensional video
streaming allows for increased immersion, e.g. via mixed reality
headsets in comparison to traditional screens. So far, capturing
3D video of sports events required expensive outside-in tracking
with numerous cameras. This study demonstrates the feasibility of
streaming sports content to mixed reality headsets as holographs
in real-time using inside-out tracking and low-cost equipment only.
We demonstrate our system by streaming a race car on an indoor
track as 3D models, which are then rendered in an Magic Leap One
headset. An onboard camera, mounted on the race car provides the
video stream used to localize the car via computer vision. The local-
ization is estimated by an end-to-end convolutional neural network
(CNN). The study compares three state-of-the-art CNN models in
their respective accuracy and execution time, with PoseNet+LSTM
achieving position and orientation accuracy of 0.35m and 3.95
◦
.
The total streaming latency in this study was 1041ms, suggesting
technical feasibility of streaming 3D sports content, e.g. on large
playgrounds, in near real-time onto mixed-reality headsets.
Index Terms: Augmented Reality—Visualization—Head mounted
display—Sport streaming; Deep learning—Image processing—
Pattern recognition—Localization
1 I NTRODUCTION
Despite allowing for increased immersion, research on capturing
and streaming three-dimensional (3D) video to consumer devices
has been under-discussed and not yet been adopted by content
creators or developers. With the advent of augmented, virtual and
mixed reality devices (altogether referred to as XR) , consuming
three-dimensional (3D) video content has become more accessible
than ever before. Compared to watching two-dimensional (2D)
video on traditional screens, 3D video streaming on XR devices
allows for increased perceived immersion in relation to the displayed
content. In fact, 3D video streaming allows for content to be
perceived as more vivid, salient, enjoyable and interactions as more
natural compared to 2D videos [2]. In order to be perceived as
enjoyable, such XR applications have to run at least 60 frames
per second for a decent user experience and at least 30 frames per
second as a minimum requirement to ensure stable and smooth
movements of the displayed holograms [15]. A large latency in the
streaming pipeline would result in a poor update of the live event
and the user would miss fast-changing situations. Therefore, it is
key to get the actual position of the sport agent frequently, requiring
fast capturing and processing of video feeds to produce a 3D video
feed in near real-time, allowing viewers to observe sports events as
they unfold. Surprisingly, 3D streaming of applications or videos in
*
e-mail: martyk@ethz.ch
†
e-mail: prithvir@mit.edu
‡
e-mail: yb sun@mit.edu
§
e-mail: fuchsk@ethz.ch
real-time has been lacking focus of attention among researchers and
practitioners [14]. Despite the recent developments and the rapidly
increasing advances in hardware and software, the commercial
breakthrough to towards mass adoption has not yet been observed,
as most applications still remain rather simple prototypes [11].
A significant barrier towards capturing 3D video content of sports
event is the technical requirements, as the practice requires outside-
in tracking via numerous, expensive cameras, preventing most sports
events to stream 3D content in real-time. Examples of 3D video
applications in the sports domain include the commercial product
FreeD’s Replay [1]. To create volumetric replay, high tech camera
equipment is necessary. For example, to stream a 3D tennis game
match, 28 cameras with a 5K resolution have to be installed around
the tennis court [22]. The equipment cost for 3D modeling with
multi-view cameras to stream 3D sport events scales with the size
of the playground. As an example, for a soccer stadium, 38 ultra-
high-definition cameras are necessary to capture the entire soccer
field as an outside-in tracking shown in figure 1. Unfortunately,
most current approaches of modeling 3D sport events still rely on
outside-in capturing of 3D video provided by multiple static cameras
around the sports field.
Figure 1: Outside-in tracking of
a soccer stadium with 38 ultra-
high-definition cameras
Figure 2: Inside-out tracking of a
race track with one onboard cam-
era mounted on the race car
Computer vision can support generating 3D videos of sports
events at substantially lower costs by enabling systems to infer 3D
content by leveraging the current location of players (e.g. humans,
race car) from 2D video feeds. Recently, a study demonstrated
feasibility of converting 2D YouTube videos of historic soccer
matches into 3D videos [18], thereby not only enabling reviewing
old soccer matches in 3D, but also allowing generating 3D videos at
low costs. To infer the location of players on a field, a convolutional
neural network is trained on 3D data, extracted from soccer video
games to estimate the depth map of each player in every pose.
After localizing the player on the field and estimate the pose, the
trained neural network calculates the corresponding depth map.
Their solution uses the field localization approach, which only
works on fields with dominant visual features, such as a soccer
field that has pre-defined layouts (i.e. white lines, green grass, four
corners). Therefore, the approach of [18] is limited for playgrounds
with dense static mono cameras around the field, not allowing for
capturing sports content on larger outdoor areas, e.g. race tracks,
and requiring the system to be calibrated for a specific field outline
only.
254
2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)
978-1-7281-6532-5/20/$31.00 ©2020 IEEE
DOI 10.1109/VRW50115.2020.0-223
Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on May 14,2020 at 15:27:52 UTC from IEEE Xplore. Restrictions apply.