Hierarchical Decision Making Scheme for Sports Video Categorisation with Temporal Post-Processing Edward Jaser, Josef Kittler and William Christmas Centre for Vision, Speech and Signal Processing University of Surrey, Guildford GU2 7XH, UK E.Jaser, J.Kittler, W.Christmas @eim.surrey.ac.uk Abstract The problem of automatic sports video classiﬁcation is con- sidered. We develop a multistage decision making system that is founded on the concept of cues, i.e. pieces of vi- sual evidence, characteristic of certain categories of sports that are extracted from key frames. The main decision mak- ing mechanism is a decision tree which generate hypothe- ses concerning the semantics of the sports video content. The ﬁnal stage of the decision making process is a Hidden Markov Model system which bridges the gap between the semantic content categorisation deﬁned by the user and the actual visual content categories. The latter is often ambigu- ous, as the same visual content may be attributed to differ- ent sport categories, depending on the context. We demon- strate experimentally that the contextual post-processing of the decision tree outputs by HMMs signiﬁcantly improves the performance of the sports video classiﬁcation system. 1. Introduction The generation of digital multimedia content continues to witness phenomenal growth. In the particular domain of sport, many events are taking place every day, and an over- whelming vast amount of sport video materials are being recorded and stored. Ideally, and to ensure usability, all this sports material should be annotated, and the meta-data, gen- erated on it, should be stored in a database along with the video data. This would allow the retrieval of any important event at a later date. Such a system has many uses, such as in the production of television sport programmes and docu- mentaries. Due to the large amount of material being generated, manual annotation is both impractical and very expensive. In this paper we consider the problem of automatic sports video categorisation. This problem arises during multidisci- plinary events such as Olympic games where huge volume of video material are recorded, with the content randomly switching from one discipline to another. A coarse auto- matic annotation in terms of sport identity would aid the production of event summaries for news cast and other ap- plications. Much research in the ﬁeld of multimedia analysis and re- trieval is targeting the domain of sport videos. The reason is that most sport videos have a well-deﬁned content struc- ture and ofﬁcial rules and procedures as compared to videos from other domains. A sport can be deﬁned as a set of one or more fundamental semantic events. The event life cycle is characterised by a starting stage, an action and a terminal stage. The action stage can be skipped depending on the status of the starting stage. The play is usually suspended at the end of each event. The repetition of these events in some order deﬁnes higher-level events and forms the struc- ture of the sport. Moreover, most sporting events take place in one location. That means only a limited number of cam- eras, most at ﬁxed position, are needed to cover the play area and capture the event. global view crowd zoom in close-up crowd global view zoom in close-up Swimming Hockey Figure 1: Sport views The camera that best captures the event taking place at a certain time is selected for broadcasting. Therefore, a set of characteristic views recorded by the cameras can be deﬁned and associated with the events. Figure 1 gives an example of some characteristic views that exist in two sport disciplines, swimming and hockey. Between the end of one event and the start of the following one in which the play is suspended, other events that can be either related to the sport (replay, 0-7695-2158-4/04 $20.00 (C) 2004 IEEE