Semantic Retrieval of Video Ziyou Xiong 1 , Xiang Zhou 2 , Qi Tian 3 , Rui Yong 4 , and Thomas S. Huang 5 Abstract: In this article we will review different research works in 3 types of video, i.e., video of meetings, movies and broadcast news, and sports video. We will then put them into a general framework of video summarization, browsing, and retrieval. We will also review different video representation techniques for these three types of video content within this general framework. At last we will present the challenges facing the video retrieval research community. 1.INTRODUCTION Video content can be accessed by using either a top- down approach or a bottom-up approach [1, 2, 3, 4]. The top-down approach, i.e. video browsing, is useful when we need to get an “essence” of the content. The bottom-up approach, i.e. video retrieval, is useful when we know exactly what we are looking for in the content, as shown in Fig. 1 When we do not exactly know what we are looking for in the content, human-computer interaction, such as relevance feedback and active learning, can be used to better match the intentions and needs of the user with the video content. Figure 1.Relationship between video retrieval and browsing In the following, we give an overview of the research work on 3 major types of video content, i.e., video of meetings, movies and broadcast news, and sports. 1.1Video of Meetings Meetings are an important part of everyday life for many workgroups. Often, due to scheduling conflicts or travel constraints, people cannot attend all of their scheduled meetings. In addition, people are often only peripherally interested in a meeting such that they want to know what happened during the meeting without actually attending; being able to browse and skim these types of meetings could be quite valuable. Initial work on summarization and retrieval of video of meetings has been reported in [33]. 1.2Movies and Broadcast News Recently, movies and news videos have received great attention by the research community basically motivated by the interest of movie makers and broadcasters in building large digital archives of their assets for reuse of archive materials for TV programs or on-line availability to other companies [5]. Movies and news have a rather definite structure and do not offer a wide variety of edit effects, which are mainly cuts, or shooting conditions (e.g., illumination). This definite structure is suitable for content analysis and has been exploited for automatic classification, for example, in [6], [7], [8], [9], [10], [11]. In all of these systems a two stage scene classification scheme is employed. First, the video stream is parsed and video shots are extracted. Each shot is then classified according to content classes such as news report, weather forecast etc. The general approach to this type of classification relies on the definition of one or more image 1 United Technologies Research Center, East Hartford, CT, Email: xiongz@utrc.utr.com 2 Siemens Corporate Research, Princeton , NJ 08540, Email: xzhou@scr.siemens.com 3 Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, Email: qitian@cs.utsa.edu 4 Microsoft Research, One Microsoft Way, Redmond, WA, E-mail:yongrui@microsoft.com 5 Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, E-mail: huang@ifp.uiuc.edu