A Static Video Summarization Approach With Automatic Shot Detection Using Color Histograms E. J. Y. Cayllahua-Cahuina, G. C´ amara-Ch´ avez, D. Menotti UFOP - Federal University of Ouro Preto Computing Department Ouro Preto, MG, Brazil Email: {ecayllahua1, gcamarac, menottid}@gmail.com Abstract—Shot detection has been widely used in video summa- rization for video analysis, indexing, and browsing. In this paper, we present an approach for static video summarization using histograms information for an automatic shot detection. The principal component analysis (PCA) is used in order to reduce the dimensionality of the feature vector. We propose the use of Fuzzy- ART and Fuzzy C-Means algorithms to automatically detect the number of clusters in the video and consequently extract the shots from the original video. The process is entirely automatic and no a priori human interaction is needed. The storyboards produced by our model are compared with the ones presented by the Open Video Project. Index Terms—video summarization, shot detection, keyframe extraction, fuzzy clustering, histograms. I. I NTRODUCTION The volume of multimedia information such as text, audio, still images, animation, and video is growing every day. The accumulated volume of this information can become a large collection of data. It would be an arduous work if a human tries to process such a large volume of data and even, at a certain scale, it would be impossible. Video is a perfect example of multimedia information. Video information is growing exponentially, and each day an enormous quantity of video is uploaded to the internet. TV video information is generated every day and security cameras generate hours of video. It is necessary to develop a model in order to manage all this information. Video summarization aims to give to a user a synthetic and useful visual summary of a video sequence. Thus, a video summary is a short version of an entire video sequence. The video summary can be represented into two fashions: a static video storyboard and a dynamic video skimming. Dynamic video skimming consists in selecting the most relevant small dynamic portions of audio and video in order to generate the video summary. On the other hand, static video storyboard is interested in selecting the most relevant frames (keyframes) of a video sequence and generate the correspondent video summary. Obviously, in this case, the key part is to recognize these relevant frames or portions of video. The models in the literature have different points of view of what is relevant and what is not and the way to extract these relevant frames. A raw video consists of a sequence of video shots. A shot is deﬁned as an image sequence that presents continuous action which is captured from a single operation of a single camera and its visual content can be represented by keyframes. In order to extract the important keyframes from a video, we need to segment it ﬁrst, usually into shots, and then analyze which will be the most representative frame in the set of frames that compose the detected shot. In this paper, we propose the use of the Fuzzy-ART [1] algorithm to automatically ﬁnd the possible number of shots and we later use the Fuzzy C-Means [2] algorihtm to dis- cover and extract the keyframes from the detected shots. Our approach is based on [3] but our modiﬁcation give us the main beneﬁt that no previous human interaction is needed as it can operate in an unsupervised way and still provide satisfactory summaries. Moreover, the proposed model is not computationally expensive, compared to the original and other models from the literature. The remainder of this paper is organized as follows. Section II provides a literature overview. Section III presents the proposed model and the details about it. In Section IV, we describe the tests and discuss the results. Finally, in Section V, we present our conclusions and the future works. II. RELATED WORK In order to generate a correct and complete summarization of a given video, the employed model would have to per- form an optimal understanding of the video semantic content. However, automatic understanding of the semantic content of videos is a very complex task and is still far beyond the intelligence of today’s computing systems, despite the signiﬁcant advances in computer vision, image processing, pattern recognition, and machine learning algorithms. In order to capture the semantic of the video, some ap- proaches [4], [5], [3] try to process a single feature of a video content, such as color histogram, motion, etc. A video is a very complex collection of data. Then, it is quite difﬁcult to effectively discriminate the most meaningful parts in a video using a single feature. This is the reason that the most recent approaches try to combine all possible features. In order to overcome this problem, recent works combine different features. For example, in [6], [7], [8], besides visual information, audio data is used. Textual information which is