Soccer Video Summarization using Deep Learning
Rockson Agyeman, Rafiq Muhammad and Gyu Sang Choi
Department of Information and Communication Engineering
Yeungnam University
Gyeongsan, Republic of Korea
{rockson, rafiq, castchoi}@ynu.ac.kr
Abstract—This paper presents a deep learning approach to
summarizing long soccer videos by leveraging the
spatiotemporal learning capability of three-dimensional
Convolutional Neural Network (3D-CNN) and Long Short-
Term Memory (LSTM) – Recurrent Neural Network (RNN).
Our proposed approach involves, 1) a step-by-step development
of a Residual Network (ResNet) based 3D-CNN that recognizes
soccer actions, 2) manually annotating 744 soccer clips from five
soccer action classes for training, and 3) training an LSTM
network on soccer features extracted by the proposed ResNet
based 3D-CNN. We combine the 3D-CNN and LSTM models to
detect soccer highlights. To summarize a soccer match video, we
model the video input as a sequential concatenation of video
segments whose inclusion in a summary video production is
based on its validated relevance. To evaluate the proposed
summarization system, 10 soccer videos were summarized and
subsequently evaluated by 48 participants polled from 8
countries using the Mean Opinion Score (MOS) scale.
Collectively, the summarized videos received a 4 of 5 MOS.
Keywords-Soccer; Highlight; Video Summarization.
I. INTRODUCTION
Soccer is one of the most enjoyed sports globally. It is built
around the concept of players moving a ball across the field of
play with the objective of kicking it into the goal of the
opponent. Soccer club managers, in particular, assess game
strategies and player performance through video analysis. The
difficulty, however, is that analysts must watch volumes of
recorded videos to identify notable events. While a logical
solution is video summarization, using conventional video
editing techniques to summarize videos for analysis is a very
time consuming and daunting task.
In discussing video summarization, it is safe to assert that
it involves the recognition of actions of interest (highlights)
and how this phenomenon travels through the entire length of
a video. Some popular automatic video summarization
techniques include keyframe selection [1,3], object tracking
[2], key sub-shot selection [3] and skims [6], to name but a
few. These techniques have proven effective at summarizing
videos generally, but fail to provide flexibility in selecting the
desired list of action sets that should be included in a summary
video. In this regard, we employ an action (highlight)
recognition-based approach as the foundation for our
summarization framework. Our contributions are summarized
as follows:
• We propose an improved three-dimensional (3D)
action recognition Convolutional Neural Network
(CNN) based on Residual Network (ResNet) [7] to be
used as a feature extractor for soccer clips.
• We collect and annotate 744 soccer clips in five action
classes; centerline, corner-kick, free-kick, goal action
and throw-in, to train our framework to recognize
soccer actions.
• We train a Long Short Term Memory (LSTM)
network on soccer features extracted by the proposed
3D-CNN. We use the combined 3D-CNN and LSTM
network as the highlight recognition framework.
• We implement a basic but effective method to produce
summarized videos based on the 3D-CNN and LSTM
highlight recognition framework, and we assess the
summarization technique using the Mean Opinion
Scores collected from 48 soccer enthusiasts.
II. RELATED WORKS
Unlike unsupervised automatic video summarization
techniques that involve the use of handcrafted algorithms to
concatenate key video frames to form summarized video
contents [1, 2], recent techniques adopt a supervised domain
knowledge learning approach to summarizing long videos [4].
While it is argued that this approach limits a framework’s
ability to generalize to other domain, this approach presents
the advantage of being able to dictate what actions should be
included in a summary video. In soccer domain, for example,
the generation of a summary video based on specific action
classes, such as all goals scored in a match, is more important
to an analyst than a generalized video content that is merely a
shorter version of the original.
Action recognition techniques can also be broadly
categorized as 1) techniques that involve handcrafted feature
extractors, such as, Bag of Words (BoW) [20] and 2)
techniques that involve the use of deep learning networks.
Handcrafted action representation techniques involve the
extraction of salient features from a sequence of image frames
to form feature descriptors, while classification is performed
on the extracted feature descriptors by training a generic
classifier such as Support Vector Machine (SVM) [19]. One
of the early works conducted on soccer highlight detection
using this method is [21].
Deep learning techniques on the other hand learn feature
representations automatically. One of the more recent
remarkable deep learning techniques capable of human action
discrimination is found in [8]. Using the new Kinetics Human
Action Video dataset to pre-train existing benchmark
270
2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)
978-1-7281-1198-8/19/$31.00 ©2019 IEEE
DOI 10.1109/MIPR.2019.00055