We propose two approaches for semantic indexing of audio–visual documents, based on bottom-up and top-down strategies. We base the ﬁrst approach on a ﬁnite- state machine using low-level motion indices extracted from an MPEG compressed bitstream. The second approach innovatively performs semantic indexing through Hidden Markov Models. I f we want widespread use and access to richer and novel information sources, we’ll need effective navigation through multimedia documents. In this context, the design of efﬁcient indexing techniques that facilitate the retrieval of relevant information is an important issue. Allowing for possible auto- matic procedures to semantically index audio–video material represents an important challenge. Ideally, we could design such methods to create suitable indices of the audio–visual material, which characterize the temporal struc- ture of a multimedia document from a semantic point of view. 1 Traditionally, the most common approach to create an index of an audio–visual document is based on the automatic detection of changes to camera records and the types of involved editing effects. This kind of approach generally demon- strates satisfactory performance and leads to a good low-level temporal characterization of the visual content. However, semantic characterization remains poor because the description is fragment- ed considering the high number of shot transitions occurring in typical audio–visual programs. Alternatively, recent research efforts base the analysis of audio–visual documents on joint audio and video processing to provide for a high- er level organization of information. 2,3 Saraceno and Leonardi 3 considered these two information sources for identifying simple scenes that com- pose an audio–visual program. Here we propose and compare the perfor- mance of two different classes of approaches for semantic indexing of audio–visual documents. In the first one, we tackle the problem in a top- down fashion to identify a speciﬁc event in a cer- tain program. In the second class, we first identify structuring elements from the data, then group them to form new patterns that we can further combine into a hierarchy. More precise- ly, we apply the top-down approach for identify- ing relevant situations in soccer video sequences. In the complementary bottom-up approach, we combine audio and visual descriptors associated to individual shots and associated audio seg- ments to extract higher level semantic entities. Many researchers have studied automatic detection of semantic events in sport games. Generally, the goal is to identify certain spatio- temporal segments corresponding to semantical- ly significant events. Tovinkere et al., 4 for example, presented a method that tries to detect the complete set of semantic events that might happen in a soccer game. This method uses the player’s and ball’s position information during the game as input. As a result, the approach requires a complex and accurate tracking system to obtain this information. In our approach, we consider only the motion information associated to an MPEG-2 bitstream. We addressed the problem by trying to identify a correlation between semantic events and the low-level motion indices associated to a video sequence. In particular, we considered three low-level indices that represent the following characteris- tics—lack of motion, camera operations (repre- sented by pan and zoom parameters), and presence of shot cuts. We then studied the corre- lation between these indices and the semantic events demonstrating their usefulness. 5,6 To exploit this correlation, we propose an algorithm based on finite-state machines that can detect the presence of goals and other relevant events in soccer games. As we mentioned earlier, in the complemen- tary bottom-up approach, we combine audio and visual descriptors to extract higher level semantic entities such as scenes or even individual program items. In particular, we perform the indexing through Hidden Markov Models (HMM) used in an innovative framework. Our approach consid- ers the input signal as a nonstationary stochastic process, modeled by an HMM in which each state stands for a different signal class. 7 2 1070-986X/02/$17.00 © 2002 IEEE Semantic Indexing of Multimedia Documents Riccardo Leonardi and Pierangelo Migliorati University of Brescia Content-Based Multimedia Indexing and Retrieval