Semantic Description of a Video using Representative Frames Ishan Jindal Electrical Engineering Indian Institute of Technology Gandhinagar Email: ijindal@iitgn.ac.in Shanmuganathan Raman Electrical Engineering Indian Institute of Technology Gandhinagar Email: shanmuga@iitgn.ac.in Abstract—Analysis of a very long video and semantically describe the contents is a challenging task in computer vision. The present approaches such as video shot detection and sum- marization address this problem partially while maintaining the temporal coherency. To reduce the user efforts for seeing the whole video we have introduced a new technique which combines similar content irrespective of their presence at different time instants. In this approach, we automatically identify only the representative frames corresponding to similar scenes which were captured at different instants of time. We also provide the labels of the objects that are present in the representative frames along with the compact representation for the video. We achieve the task of semantic labelling of frames in a uniﬁed framework using a deep learning framework involving pre-trained features through a convolutional neural network. We show that the proposed approach is able to address the semantic labelling effectively as justiﬁed by the results obtained for videos of different scenes captured through different modalities. I. I NTRODUCTION Advancement in internet technologies and the proliﬁc usage of mobile cameras have lead to the creation of publicly available videos in online websites like YouTube, Vimeo, etc. Around 300 hours of video content are being uploaded to YouTube every minute [1]. For the video surveillance applica- tions, the data is acquired 24 ×7. Also, for the personal record, people capture long length videos like wedding ceremonies, children activities, travelogues, and many more important occasions. It is always very time consuming to process these lengthy videos. This leaves us with a question as to how can we make this time consuming task of understanding the contents of the videos simple. In this work, we have come up with a novel approach which facilitates automatic identiﬁcation of the important contents present in a given video even when the camera is hand-held and the changes in the scene are drastic. We do not perform registration of the frames. The main objective of this paper is to develop an algorithm to reduce the time requirement for understanding the important events in a whole length video. To achieve this objective, we ﬁrst approach the problem by detecting the number of important similar events present in the video. This is achieved by measuring the similarity between the frames of the video. When the similarity measure crosses a predeﬁned threshold we normally detect it as a shot. Thus, counting number of shots gives the number of candidate important events. These events can be clustered to determine the important activities present in the video irrespective of their temporal occurrences. We then label the representative frames identiﬁed based on the objects present in the scene. This task is achieved by using the convolutional neural network (CNN) architecture pre- trained on labelled images obtained from ImageNet dataset. The method developed can be used for content based video retrieval applications which involve text inputs [2], surveillance applications to group different activities [3], and to understand the scene patterns across different videos for massive video datasets [4]. The primary contributions of this work are listed below. 1) Group all the frames from a video which are sim- ilar and identify representative frames depicting the different activities and scene changes. 2) Semantically describe the scene present in the repre- sentative frame by labels of the objects present using convolutional neural networks (CNN). II. RELATED WORK Different works have been carried out in the recent past for detecting the shots in the videos [5], [6], [7], [8], [9], [10]. Zhai and Shah introduced weak and strong boundaries in the video and developed a framework based on Markov chain Monte Carlo (MCMC) technique for scene segmentation [6]. In [7], both visual and textual information is used for detecting the shots. This was an attempt to detect stories in the videos and the dataset consisted of only news videos from CNN and ABC networks. This work was quite complicated as it ﬁrst detects the objects like faces of the repetitive TV anchors and then detect the stories by recognizing these faces. Other methods detect shots using singular value decomposition [11] and adaptive [12] threshold for similarity between the frames technique, supervised learning based SVM classiﬁer to separate cuts from non-cuts [13]. [14] combine the training information (SVM) with global threshold approach by ﬁrst detecting the shots using a threshold and then conﬁrming the shots using SVM. [15] analyze the shot detection problem at length and provide an optimization based statistical approach for detecting the shots. [16] provides the review of various techniques for both shot detection and condensed representation of videos. This review concluded that most automatic methods fail in the extraction of representative frames because there is ambiguity This is a pre-print version of a paper accepted to NCVPRIPG 2015, IIT Patna.