Soccer Video Summarization using Deep Learning Rockson Agyeman, Rafiq Muhammad and Gyu Sang Choi Department of Information and Communication Engineering Yeungnam University Gyeongsan, Republic of Korea {rockson, rafiq, castchoi}@ynu.ac.kr Abstract—This paper presents a deep learning approach to summarizing long soccer videos by leveraging the spatiotemporal learning capability of three-dimensional Convolutional Neural Network (3D-CNN) and Long Short- Term Memory (LSTM) – Recurrent Neural Network (RNN). Our proposed approach involves, 1) a step-by-step development of a Residual Network (ResNet) based 3D-CNN that recognizes soccer actions, 2) manually annotating 744 soccer clips from five soccer action classes for training, and 3) training an LSTM network on soccer features extracted by the proposed ResNet based 3D-CNN. We combine the 3D-CNN and LSTM models to detect soccer highlights. To summarize a soccer match video, we model the video input as a sequential concatenation of video segments whose inclusion in a summary video production is based on its validated relevance. To evaluate the proposed summarization system, 10 soccer videos were summarized and subsequently evaluated by 48 participants polled from 8 countries using the Mean Opinion Score (MOS) scale. Collectively, the summarized videos received a 4 of 5 MOS. Keywords-Soccer; Highlight; Video Summarization. I. INTRODUCTION Soccer is one of the most enjoyed sports globally. It is built around the concept of players moving a ball across the field of play with the objective of kicking it into the goal of the opponent. Soccer club managers, in particular, assess game strategies and player performance through video analysis. The difficulty, however, is that analysts must watch volumes of recorded videos to identify notable events. While a logical solution is video summarization, using conventional video editing techniques to summarize videos for analysis is a very time consuming and daunting task. In discussing video summarization, it is safe to assert that it involves the recognition of actions of interest (highlights) and how this phenomenon travels through the entire length of a video. Some popular automatic video summarization techniques include keyframe selection [1,3], object tracking [2], key sub-shot selection [3] and skims [6], to name but a few. These techniques have proven effective at summarizing videos generally, but fail to provide flexibility in selecting the desired list of action sets that should be included in a summary video. In this regard, we employ an action (highlight) recognition-based approach as the foundation for our summarization framework. Our contributions are summarized as follows: • We propose an improved three-dimensional (3D) action recognition Convolutional Neural Network (CNN) based on Residual Network (ResNet) [7] to be used as a feature extractor for soccer clips. • We collect and annotate 744 soccer clips in five action classes; centerline, corner-kick, free-kick, goal action and throw-in, to train our framework to recognize soccer actions. • We train a Long Short Term Memory (LSTM) network on soccer features extracted by the proposed 3D-CNN. We use the combined 3D-CNN and LSTM network as the highlight recognition framework. • We implement a basic but effective method to produce summarized videos based on the 3D-CNN and LSTM highlight recognition framework, and we assess the summarization technique using the Mean Opinion Scores collected from 48 soccer enthusiasts. II. RELATED WORKS Unlike unsupervised automatic video summarization techniques that involve the use of handcrafted algorithms to concatenate key video frames to form summarized video contents [1, 2], recent techniques adopt a supervised domain knowledge learning approach to summarizing long videos [4]. While it is argued that this approach limits a framework’s ability to generalize to other domain, this approach presents the advantage of being able to dictate what actions should be included in a summary video. In soccer domain, for example, the generation of a summary video based on specific action classes, such as all goals scored in a match, is more important to an analyst than a generalized video content that is merely a shorter version of the original. Action recognition techniques can also be broadly categorized as 1) techniques that involve handcrafted feature extractors, such as, Bag of Words (BoW) [20] and 2) techniques that involve the use of deep learning networks. Handcrafted action representation techniques involve the extraction of salient features from a sequence of image frames to form feature descriptors, while classification is performed on the extracted feature descriptors by training a generic classifier such as Support Vector Machine (SVM) [19]. One of the early works conducted on soccer highlight detection using this method is [21]. Deep learning techniques on the other hand learn feature representations automatically. One of the more recent remarkable deep learning techniques capable of human action discrimination is found in [8]. Using the new Kinetics Human Action Video dataset to pre-train existing benchmark 270 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) 978-1-7281-1198-8/19/$31.00 ©2019 IEEE DOI 10.1109/MIPR.2019.00055