Learning Deep C3D Features For Soccer Video Event Detection Muhammad Zeeshan Khan, Summra Saleem, Muhammad A. Hassan, Muhammad Usman Ghanni Khan Al-Khwarizmi Institute of Computer Science UET, Lahore 1,2,3,4 Computer Science Department UET, Lahore 4 zeeshan.khan@kics.edu.pk Abstract—Soccer video event identiﬁcation has been an in- teresting task in research community during past few decades. Numerous machine learning techniques and C2D (Convolution 2-dimensional) have been used for this problem, but C3D has not been implemented for this task. By taking advantage from C3D (Convolution 3-dimensional), to completely exploit saptio- temporal relation, deep convolution network is developed to high- light distinct video events in proposed research work. Initially, we detect soccer video event marks by pixel differencing and edge change ratio. After this semantic features of segmented frames are extracted followed by CNN to map soccer event categories: Corner, Shoot, Goal Attempt, Penalty Kick. Because no effective and suitable dataset is available, we categorized soccer videos into four classes and developed soccer videos dataset for training CNN network. Evaluation results on soccer match clips generated results with high efﬁciency. Index Terms—soccer events, 3D CNN, scene boundary detec- tion, c3d (Convolution 3D) I. I NTRODUCTION Now a days, one of the key task in image processing and computer vision is detect and generate description from the images as well as from videos. In the preceding few spans, work related to this ﬁeld expanded very fast. Activity recognition is one of the promising task, because of its signiﬁcant role in applications which work in a real time environment based on the activity recognition. These appli- cations encompass intelligent video surveillance, customer attributes, shopping behaviour analysis, analysis of unusual events and interaction of human with computer, telehealth, biometrics, video indexing and virtual coaching etc. Although huge research has been made for activity recognition but due to the unpredictable and uncertain behaviour of the human being the automatic detection of human activity has always remained the challenging task. Due to the intra class deviations other problems related to image like action detection, event detection and recognition is also a difﬁcult task. So for this reason at present, most of the approaches makes prediction based on assumptions on a small scale and view point change. Football is among one of the sports which is most fascinating and affable sports game in the world. Annotating and evaluat- ing the sports video fascinates lots of researchers, because of the attractiveness and the attention of the people as well as the broadcasters, in comparison to conventional videos. Detection and recognition of some speciﬁc part of the particular sports is extremely exciting. After the progression of deep learning and computational assets the accurateness of the image related problems augmented considerably. We distinguish the explicit football match events using the pattern recognition techniques. In computer vision approaches used for activity recognition based on the two steps, in which the ﬁrst one is to learn deep features from the raw frames of the videos and second one is learning classiﬁer based on these features. Deep learning models construct the high level features from the low level. So in order to make the results more accurate we have taken advantage of deep neural network. Our proposed system takes the video as an input followed by breaking the video into the chunks on the basis of scene change detection. Then we pass each chunk to our c3d (Convolution 3-dimensional) architecture and label the particular chunk from our predeﬁned categories. Framework diagram of our proposed system is depicted in ﬁgure 1. II. RELATED WORK A lot of work in literature has been presented on activity recognition. Most of the work on activity recognition based on the concept of pattern recognition. Different approaches have been used till now for the automatic detection of human action recognition. Conventional computer vision methodologies are based on background subtraction, edge detections and feature extractions for analyzing human actions. Most of the algo- rithms used for video action classiﬁcation are built on shallow high dimensional encoding of local spatial temporal features. Laptev et al. implemented [1] implemented methodology for learning realistic human actions from movies based on de- tecting the sparse spatial temporal interest points and then described it using local spatial temporal features. These fea- tures are then encoded with Bag of visual words features and then passed to the SVM classiﬁer for getting prediction. Later H. Wang et al. [2] have observed that the dense sampling of spatial temporal features performs well as compared to the sparse temporal points. Whereas R. Girdhar et al. [3] trained the discriminative model that was based on the temporal segmentation of the video sequence and appearance model for each motion segment. Neural network had also been applied for activity detection using convolution neural network. The architecture proposed by the A. Karpathy et al. [4] uses two streams of convolution neural networks; the ﬁrst stream receives input as down sampled frames at half the original spatial resolution and