Video Shots Key-Frames Indexing and Retrieval Through Pattern Analysis and Fusion Techniques Rachid Benmokhtar and Benoit Huet Institut Eur´ ecom - D´ epartement Multim´ edia 2229, route des crˆ etes 06904 Sophia-Antipolis - France (Rachid.Benmokhtar, Benoit.Huet)@eurecom.fr Sid-Ahmed Berrani and Patrick Lechat Orange-France Telecom R&D 4, rue du Clos Courtel 35512 Cesson - S´ evign´ e Cedex (Sidahmed.Berrani, Patrick.Lechat)@orange-ftgroup.com Abstract— This paper proposes an automatic semantic video content indexing and retrieval system based on fusing various low level visual and shape descriptors. Extracted features from region and sub-image blocks segmentation of video shots key-frames are described via IVSM signature (Image Vector Space Model) in order to have a compact and efﬁcient description of the content. Static feature fusion based on averaging and concatenation are introduced to obtain effective signatures. Support Vector Machines (SVM) and neurals network (NNs) are employed to perform classiﬁcation. The task of the classiﬁers is to detect the video semantic content. Then, classiﬁers outputs are fused using neural network based on evidence theory (NN-ET) in order to provide a decision on the content of each shot. The experimental results are conducted in the framework of soccer video feature extraction task 1 . Keywords: Feature fusion, classiﬁcation, classiﬁer fusion, neural network, evidence theory, CBIR. I. I NTRODUCTION With the development of the internet, multimedia infor- mation such as images and videos, have become the major sources on the internet. An efﬁcient image and video retrieval system is highly desired to narrow down the well know semantic gap between the visual features and the richness of humain semantics. To respond to the increase in audiovisual information, various methods for indexing, classiﬁcation and fusion have emerged. The need to analyse the content has appeared to facilitate understanding and contribute to a better automatic video content indexing and retrieval. The retrieval of complex semantic concepts requires the analysis of many features per modalities. The task consisting of combining of all these different parameters is far from being trivial. The fusion mechanism can take place at different levels of the classiﬁcation process. Generally, it is either applied on signatures (feature fusion) or on classiﬁers outputs (classiﬁer fusion). This paper presents our research conducted toward a se- mantic video content indexing and retrieval system. It aims at doing tasks such as the high level feature detection task of TrecVid but limited as far as this paper is concerned with the application domain of soccer game analysis. It starts with a description of our automatic system architecture. We 1 The work presented here is funded by Orange-France Telecom R&D under CRE 46134752. Video Features Extraction K-means Clustering IVSM Visual Dictionary step Extraction Features Region Segmentation Block Segmentation Features Extraction K-means Clustering IVSM Visual Dictionary HSVreg RGBreg Gaborreg Gaborblock EHDblock RGBblock HSVblock Neural Network Based on Evidence Theory NN-ET step tion Classifica step Fusion 2 svm 4 svm 2 nn 3 nn 4 nn SVM SVM SVM NN NN NN SVM SVM SVM NN NN NN 5 svm 6 svm 7 svm 5 nn 7 nn 8 nn NN SVM 3 svm 8 svm 6 nn NN SVM feature merged 1 svm 1 nn Features Fusion step Semantic Concept detection Fig. 1. General framework of the application. distinguish four steps: features extraction, features fusion, classiﬁcation and fusion. The overall processing chain of our system is presented in ﬁgure 1. The feature extraction step consists in creating a set of low level descriptors (based on color, texture and shape). The static feature fusion step is achieved based on two distinct approaches: Average and concatenation. Both are descripted, implemented and evaluated with the objective of obtaining effective signature for each key- frame. The classiﬁcation step is used to estimate the video semantic content. Both Support Vector Machine (SVMs) and Neural Networks (NNs) are employed. In the ﬁnal stage of our system, fusion of classiﬁer outputs is performed thanks to a neural network based on evidence theory (NN-ET). The experimental results presented in this paper are con- ducted in the application domain of soccer game videos. The aim is to automatically detect game actions and views (Such as center view, left goal, side view, player close-up, etc...) from video analysis. This study reports the efﬁciency of fusion mechanisms (Before and post classiﬁcation) and shows the improvement provided by our proposed scheme. Finally, we conclude with a summary of the most important results provided by this study along with some possible extensions of this work.