We propose two
approaches for
semantic indexing of
audio–visual
documents, based
on bottom-up and
top-down strategies.
We base the first
approach on a finite-
state machine using
low-level motion
indices extracted
from an MPEG
compressed
bitstream. The
second approach
innovatively
performs semantic
indexing through
Hidden Markov
Models.
I
f we want widespread use and access to
richer and novel information sources,
we’ll need effective navigation through
multimedia documents. In this context,
the design of efficient indexing techniques that
facilitate the retrieval of relevant information is
an important issue. Allowing for possible auto-
matic procedures to semantically index
audio–video material represents an important
challenge. Ideally, we could design such methods
to create suitable indices of the audio–visual
material, which characterize the temporal struc-
ture of a multimedia document from a semantic
point of view.
1
Traditionally, the most common approach to
create an index of an audio–visual document is
based on the automatic detection of changes to
camera records and the types of involved editing
effects. This kind of approach generally demon-
strates satisfactory performance and leads to a good
low-level temporal characterization of the visual
content. However, semantic characterization
remains poor because the description is fragment-
ed considering the high number of shot transitions
occurring in typical audio–visual programs.
Alternatively, recent research efforts base the
analysis of audio–visual documents on joint
audio and video processing to provide for a high-
er level organization of information.
2,3
Saraceno
and Leonardi
3
considered these two information
sources for identifying simple scenes that com-
pose an audio–visual program.
Here we propose and compare the perfor-
mance of two different classes of approaches for
semantic indexing of audio–visual documents. In
the first one, we tackle the problem in a top-
down fashion to identify a specific event in a cer-
tain program. In the second class, we first
identify structuring elements from the data, then
group them to form new patterns that we can
further combine into a hierarchy. More precise-
ly, we apply the top-down approach for identify-
ing relevant situations in soccer video sequences.
In the complementary bottom-up approach, we
combine audio and visual descriptors associated
to individual shots and associated audio seg-
ments to extract higher level semantic entities.
Many researchers have studied automatic
detection of semantic events in sport games.
Generally, the goal is to identify certain spatio-
temporal segments corresponding to semantical-
ly significant events. Tovinkere et al.,
4
for
example, presented a method that tries to detect
the complete set of semantic events that might
happen in a soccer game. This method uses the
player’s and ball’s position information during
the game as input. As a result, the approach
requires a complex and accurate tracking system
to obtain this information.
In our approach, we consider only the motion
information associated to an MPEG-2 bitstream.
We addressed the problem by trying to identify
a correlation between semantic events and the
low-level motion indices associated to a video
sequence.
In particular, we considered three low-level
indices that represent the following characteris-
tics—lack of motion, camera operations (repre-
sented by pan and zoom parameters), and
presence of shot cuts. We then studied the corre-
lation between these indices and the semantic
events demonstrating their usefulness.
5,6
To
exploit this correlation, we propose an algorithm
based on finite-state machines that can detect
the presence of goals and other relevant events
in soccer games.
As we mentioned earlier, in the complemen-
tary bottom-up approach, we combine audio and
visual descriptors to extract higher level semantic
entities such as scenes or even individual program
items. In particular, we perform the indexing
through Hidden Markov Models (HMM) used in
an innovative framework. Our approach consid-
ers the input signal as a nonstationary stochastic
process, modeled by an HMM in which each state
stands for a different signal class.
7
2 1070-986X/02/$17.00 © 2002 IEEE
Semantic
Indexing of
Multimedia
Documents
Riccardo Leonardi and Pierangelo Migliorati
University of Brescia
Content-Based Multimedia Indexing and
Retrieval