An Open Framework for Video Content Analysis Chia-Wei Liao, Kai-Hsuan Chan, Bin-Yi Cheng, Chi-Hung Tsai, Wen-Tsung Chang, and Yu-Ling Chuang * * Advanced Research Institute, Institute for Information Industry, Taipei, Taiwan, R.O.C. E-mail: cliao@iii.org.tw, kaihsuan@gmail.com, binyi@iii.org.tw, brick@iii.org.tw, wtchang@iii.org.tw, ilmachuang@iii.org.tw Abstract—In the past few years, the amount of the internet video has grown rapidly, and it has become a major market. Efficient video indexing and retrieval, therefore, is now an important research and system-design issue. Reliable extraction of metadata from video as indexes is one major step toward efficient video management. There are numerous video types, and theoretically, everybody can define his/her own video types. The nature of video can be so different that we may end up, for each video type, having a dedicated video analysis module, which is in itself nontrivial to implement. We believe an open video analysis framework should help when one needs to process various types of videos. In the paper, we propose an open video analysis framework where the video analysis modules are developed and deployed as plug-ins. In addition to plug-in management, it provides a runtime environment with standard libraries and proprietary rule-based automaton modules to facilitate the plug-in development. A prototype has been implemented and proved with some experimental plug-ins. I. INTRODUCTION AND RELATED WORK With the development of video infrastructure such as faster network, cheaper and bigger storage, and popularity of digital cameras, the internet videos grow rapidly. For example, 60 hours of video clips is, on average, uploaded to YouTube per minute, and over 4 billion video clips viewed per day [1]. In this tremendous growth of media data, videos can be searched more efficiently, if they are accurately tagged or annotated. Manual annotation is feasible when there are a small number of videos. We will need an automatic annotation/tagging mechanism when facing a sea of videos. Automatic annotation is a challenging problem [2] due to the difficulty in classification. Video analysis entails computer vision technologies and related domain knowledge (e.g. sports videos). There is an abundance of the survey of integrated automatic annotation. In Reference [3], video abstracts are used for multimedia archives, and videos are segmented and analyzed to extract special features (e.g. faces, dialog, and text). These features help assemble video clips of interest into an abstract. References [4][5] provide a comprehensive overview of existing video abstract methods, problem formulation, result evaluation, and systematic classification of different approaches. In addition to the integrated surveys, much research has been done on annotation of specific video types. Some researchers focus on the unified framework. By recognizing the common audio events [6][7][8], the events can be detected in sport videos (such as baseball, golf, soccer, swimming, and races). The use of low level visual features (such as color histogram, and oriented gradients) to detect highlights is proposed in [9][10][11]. References [12][13] suggest content- based retrieval to detect the court, players and the ball in a tennis game. Frameworks, which combine the visual and audio features to extract multimedia highlights, are introduced in [14][15][16]. The linguistic annotation tools, such as Transana, produce descriptive or analytic notations from raw data such as video [17], and are used by professional researchers to analyze digital video or audio data. Most linguistic annotation tools conduct analysis manually, and should be considerably benefited by automatic annotation if available. In this paper, we propose an open automaton-based video analysis framework, where video analysis modules can be dynamically managed (e.g. added and deleted). Moreover, our framework lowers the threshold of the development of video analysis modules. We first give a high-level overview of our framework and depict major components afterwards. Then, an experiment based on a tennis plug-in (analyzer) validates our system design. At the end, the conclusion and future work are addressed. II. FRAMEWORK Video metadata is a high-level representation of the original video. Indexed properly with right metadata, video can be efficiently retrieved and searched in a database. Extraction of reliable and useful metadata from a video involves video understanding and content analysis, and it is always a tall order in terms of both algorithms and system design (and implementation). With a framework providing run-time libraries and higher-level control commands, developers can focus more on algorithm development, instead of detailed and tedious system design and implementation (such as flow control and cache for optimization). There are virtually an unlimited number of video types, and oftentimes the video type definition and classification are subjective. Each video type could need a dedicated content analyzer, and each analyzer, itself, is a challenge in both of the algorithm and system development. It is impractical to expect a single system to handle most of the video types. Our proposed framework comes with a runtime environment for video analysis and a manager managing videos and their analyzers. In this framework, video analyzers can be added or removed freely as plug-ins. Video applications access plug-ins through the framework. The