SOCCER VIDEO EVENT DETECTION WITH VISUAL KEYWORDS Yu-Lin Kang #* , Joo-Hwee Lim # , Qi Tian # , Mohan S. Kankanhalli * # Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore 119613 {yulin,joohwee,tian}@i2r.a-star.edu.sg * School of Computing National University of Singapore Kent Ridge, Singapore 119260 mohan@comp.nus.edu.sg Abstract In this paper, we propose a new two-level framework to analyze high-level structure of video and to detect useful events automatically based on visual keywords. The first level extracts low-level features such as motion, color, texture etc to detect video segments boundaries and label segments as visual keywords. We then apply an event detection grammar to the visual keywords sequence at the second level to detect video segments that match the pre- defined event model. The exact position of the segment that the event occurs can also be spotted. We have applied the proposed approach to the detection of goal and corner-kick events in the portions of 4 FIFA World Cup 2002 soccer videos (1666 segments) with more than 80% accuracy. 1. Introduction The amount of accessible video information has been increasing rapidly. People quickly get lost in myriad of video data as it is voluminous, and hence it is very time- consuming to locate a relevant video segment linearly. Thus automatic detection of semantic events in video will be very useful. In particular, an increasing number of event detection algorithms are being developed for sports video. In the case of the soccer game that attracts a global viewer-ship, research effort has been focused on extracting high-level structures [6-8] and detecting key highlights to facilitate annotation and browsing [1,2,5]. All soccer event detection systems known to us share two common features. First the modeling of high-level events such as play-break, corner kicks, goals etc are anchored directly on low-level features such as motion and colors [1,5,6] leaving a large semantic gap between computable features and content meaning as understood by humans. Second some of these systems tend to engineer the analysis process with very specific domain knowledge to achieve more accurate object or/and event recognition. This kind of highly domain-dependent approach makes the production process and resulting system very much ad-hoc and not reusable even for a similar domain (e.g. another type of sports video). In this paper, we propose a novel two-level event detection framework and demonstrate it on soccer videos. Our goal is to make our system adaptable to different events in different domains. To achieve our goal, we introduce a mid-level representation called visual keywords that can be learned and detected from video segments. Based on the visual keywords, a computational system that realizes the framework comprises two levels of processing (Figure 1): 1. The first level focuses on video segmentation and visual keyword classification. The video stream is partitioned into video segments and each segment is labeled with one or more keywords with certainty values at this level. In the simpler case considered in this paper, only one visual keyword is assigned to each segment. In other words, the first level parses the video stream and outputs a sequence of visual keywords. In a separate paper [4], a system for soccer video segmentation and classification using color and motion features is presented. As a whole, the system has achieved around 80% accuracy in visual keyword classification. Therefore we only concentrate on the work at the second level in this paper. 2. The second level deals with event detection. In general, the probabilistic mapping between the visual keyword sequence and the events can be modeled either statistically (e.g. HMM) or syntactically (e.g. grammar). In this paper, we develop a unique event detection grammar to parse and detect events from visual keywords sequence. Indeed this two-level design makes our system flexible: it can be applied to different events by adapting the event detection grammar to the new event model. It can also be applied to different domains by adapting the vocabulary of visual keywords and its classifiers. 0-7803-8185-8/03/$17.00 © 2003 IEEE 3C2.4 ICICS-PCM 2003 15-18 December 2003 Singapore