SOCCER VIDEO EVENT DETECTION WITH VISUAL KEYWORDS
Yu-Lin Kang
#*
, Joo-Hwee Lim
#
, Qi Tian
#
, Mohan S. Kankanhalli
*
#
Institute for Infocomm Research
21 Heng Mui Keng Terrace
Singapore 119613
{yulin,joohwee,tian}@i2r.a-star.edu.sg
*
School of Computing
National University of Singapore
Kent Ridge, Singapore 119260
mohan@comp.nus.edu.sg
Abstract
In this paper, we propose a new two-level framework to
analyze high-level structure of video and to detect useful
events automatically based on visual keywords. The first
level extracts low-level features such as motion, color,
texture etc to detect video segments boundaries and label
segments as visual keywords. We then apply an event
detection grammar to the visual keywords sequence at the
second level to detect video segments that match the pre-
defined event model. The exact position of the segment
that the event occurs can also be spotted. We have applied
the proposed approach to the detection of goal and
corner-kick events in the portions of 4 FIFA World Cup
2002 soccer videos (1666 segments) with more than 80%
accuracy.
1. Introduction
The amount of accessible video information has been
increasing rapidly. People quickly get lost in myriad of
video data as it is voluminous, and hence it is very time-
consuming to locate a relevant video segment linearly.
Thus automatic detection of semantic events in video will
be very useful. In particular, an increasing number of
event detection algorithms are being developed for sports
video. In the case of the soccer game that attracts a global
viewer-ship, research effort has been focused on extracting
high-level structures [6-8] and detecting key highlights to
facilitate annotation and browsing [1,2,5].
All soccer event detection systems known to us share two
common features. First the modeling of high-level events
such as play-break, corner kicks, goals etc are anchored
directly on low-level features such as motion and colors
[1,5,6] leaving a large semantic gap between computable
features and content meaning as understood by humans.
Second some of these systems tend to engineer the analysis
process with very specific domain knowledge to achieve
more accurate object or/and event recognition. This kind of
highly domain-dependent approach makes the production
process and resulting system very much ad-hoc and not
reusable even for a similar domain (e.g. another type of
sports video).
In this paper, we propose a novel two-level event detection
framework and demonstrate it on soccer videos. Our goal
is to make our system adaptable to different events in
different domains. To achieve our goal, we introduce a
mid-level representation called visual keywords that can be
learned and detected from video segments. Based on the
visual keywords, a computational system that realizes the
framework comprises two levels of processing (Figure 1):
1. The first level focuses on video segmentation and
visual keyword classification. The video stream is
partitioned into video segments and each segment is
labeled with one or more keywords with certainty
values at this level. In the simpler case considered in
this paper, only one visual keyword is assigned to
each segment. In other words, the first level parses the
video stream and outputs a sequence of visual
keywords.
In a separate paper [4], a system for soccer video
segmentation and classification using color and
motion features is presented. As a whole, the system
has achieved around 80% accuracy in visual keyword
classification. Therefore we only concentrate on the
work at the second level in this paper.
2. The second level deals with event detection. In
general, the probabilistic mapping between the visual
keyword sequence and the events can be modeled
either statistically (e.g. HMM) or syntactically (e.g.
grammar). In this paper, we develop a unique event
detection grammar to parse and detect events from
visual keywords sequence.
Indeed this two-level design makes our system flexible: it
can be applied to different events by adapting the event
detection grammar to the new event model. It can also be
applied to different domains by adapting the vocabulary of
visual keywords and its classifiers.
0-7803-8185-8/03/$17.00 © 2003 IEEE
3C2.4
ICICS-PCM 2003
15-18 December 2003
Singapore