Video Shots Key-Frames Indexing and Retrieval Through Pattern Analysis and Fusion Techniques Rachid Benmokhtar and Benoit Huet Institut Eur´ ecom - D´ epartement Multim´ edia 2229, route des crˆ etes 06904 Sophia-Antipolis - France (Rachid.Benmokhtar, Benoit.Huet)@eurecom.fr Sid-Ahmed Berrani and Patrick Lechat Orange-France Telecom R&D 4, rue du Clos Courtel 35512 Cesson - S´ evign´ e Cedex (Sidahmed.Berrani, Patrick.Lechat)@orange-ftgroup.com Abstract— This paper proposes an automatic semantic video content indexing and retrieval system based on fusing various low level visual and shape descriptors. Extracted features from region and sub-image blocks segmentation of video shots key-frames are described via IVSM signature (Image Vector Space Model) in order to have a compact and efficient description of the content. Static feature fusion based on averaging and concatenation are introduced to obtain effective signatures. Support Vector Machines (SVM) and neurals network (NNs) are employed to perform classification. The task of the classifiers is to detect the video semantic content. Then, classifiers outputs are fused using neural network based on evidence theory (NN-ET) in order to provide a decision on the content of each shot. The experimental results are conducted in the framework of soccer video feature extraction task 1 . Keywords: Feature fusion, classification, classifier fusion, neural network, evidence theory, CBIR. I. I NTRODUCTION With the development of the internet, multimedia infor- mation such as images and videos, have become the major sources on the internet. An efficient image and video retrieval system is highly desired to narrow down the well know semantic gap between the visual features and the richness of humain semantics. To respond to the increase in audiovisual information, various methods for indexing, classification and fusion have emerged. The need to analyse the content has appeared to facilitate understanding and contribute to a better automatic video content indexing and retrieval. The retrieval of complex semantic concepts requires the analysis of many features per modalities. The task consisting of combining of all these different parameters is far from being trivial. The fusion mechanism can take place at different levels of the classification process. Generally, it is either applied on signatures (feature fusion) or on classifiers outputs (classifier fusion). This paper presents our research conducted toward a se- mantic video content indexing and retrieval system. It aims at doing tasks such as the high level feature detection task of TrecVid but limited as far as this paper is concerned with the application domain of soccer game analysis. It starts with a description of our automatic system architecture. We 1 The work presented here is funded by Orange-France Telecom R&D under CRE 46134752. Video Features Extraction K-means Clustering IVSM Visual Dictionary step Extraction Features Region Segmentation Block Segmentation Features Extraction K-means Clustering IVSM Visual Dictionary HSVreg RGBreg Gaborreg Gaborblock EHDblock RGBblock HSVblock Neural Network Based on Evidence Theory NN-ET step tion Classifica step Fusion 2 svm 4 svm 2 nn 3 nn 4 nn SVM SVM SVM NN NN NN SVM SVM SVM NN NN NN 5 svm 6 svm 7 svm 5 nn 7 nn 8 nn NN SVM 3 svm 8 svm 6 nn NN SVM feature merged 1 svm 1 nn Features Fusion step Semantic Concept detection Fig. 1. General framework of the application. distinguish four steps: features extraction, features fusion, classification and fusion. The overall processing chain of our system is presented in figure 1. The feature extraction step consists in creating a set of low level descriptors (based on color, texture and shape). The static feature fusion step is achieved based on two distinct approaches: Average and concatenation. Both are descripted, implemented and evaluated with the objective of obtaining effective signature for each key- frame. The classification step is used to estimate the video semantic content. Both Support Vector Machine (SVMs) and Neural Networks (NNs) are employed. In the final stage of our system, fusion of classifier outputs is performed thanks to a neural network based on evidence theory (NN-ET). The experimental results presented in this paper are con- ducted in the application domain of soccer game videos. The aim is to automatically detect game actions and views (Such as center view, left goal, side view, player close-up, etc...) from video analysis. This study reports the efficiency of fusion mechanisms (Before and post classification) and shows the improvement provided by our proposed scheme. Finally, we conclude with a summary of the most important results provided by this study along with some possible extensions of this work.