Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. JOURNAL OF L A T E X CLASS FILES, VOL. , NO. , JANUARY 2013 1 Event Detection and Summarization in Soccer Videos Using Bayesian Network and Copula Mostafa Tavassolipour, Mahmood Karimian, and Shohreh Kasaei, Senior Member, IEEE Abstract—Semantic video analysis and automatic concept ex- traction play an important role in several applications; including content-based search engines, video indexing, and video summa- rization. As the Bayesian network is a powerful tool for learning complex patterns, a novel Bayesian network-based method is proposed for automatic event detection and summarization in soccer videos. The proposed method includes efficient algorithms for shot boundary detection, shot view classification, mid-level visual feature extraction, and construction of the related Bayesian network. The method contains of three main stages. In the first stage, the shot boundaries are detected. Using the hidden Markov model, the video is segmented into large and meaningful semantic units, called “play-break” sequences. In the next stage, several features are extracted from each of these units. Finally, in the last stage, in order to achieve high level semantic features (events and concepts), the Bayesian network is used. The basic part of the method is constructing the Bayesian network, for which the structure is estimated using the Chow-Liu tree. The joint distributions of random variables of the network are modeled by applying the Farlie-Gumbel-Morgenstern family of Copulas. The performance of the proposed method is evaluated on a dataset with about 9 hours of soccer videos. The method is capable of detecting seven different events in soccer videos; namely, goal, card, goal attempt, corner, foul, offside, and non-highlights. Experimental results show the effectiveness and robustness of the proposed method on detecting these events. Index Terms—Bayesian Network, Copula Distribution, Seman- tic Video Analysis, Video Summarization. I. I NTRODUCTION R ATE of audio-visual data produced in the form of image and video has increased rapidly in recent years. With the growth of computational power and electronic storage capacity, necessity of large digital image/video libraries has also increased. For example, in the Internet there exist a large amount of anonymous image and video data which are the bases of many entertainments, educational, and commercial applications. This makes the search among image/video data as a challenging issue for users. Thus, an appropriate digital image/video library must provide an easy access to informa- tion and facilitates the retrieval of the content. When the total length of videos reach to thousands of hours, users need a system for summarizing and abstracting them in Manuscript received March 14, 2012; revised Dec. 19, 2012. M. Tavassolipour is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran (e-mail: tavassolipour@ce.sharif.edu) M. Karimian is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran (e-mail: karimian@ce.sharif.edu) S. Kasaei is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran e-mail: skasaei@sharif.edu (see http://sharif.edu/∼skasaei). Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. order to have an efficient and effective search. Although text- based search algorithms assist users to find a specific image or a segment of long videos, in most cases, the system outputs many irrelevant videos to ensure the retrieval of the objective video. Intelligent indexing and summarization systems are the essential need in the process of video retrieval. Among all video types, sports videos attract many viewers and usually last for long hours. Sports videos, in general, are composed of some interesting events which capture users’ attention. For most people, a summarized version of the sports video is more attractive than the full length version. Although a generic sports video summarization system is sufficient and useful, the summarization system in a domain-specific manner, such as soccer videos, may offer more facilities to users. Most of sports broadcasters use some editing effects such as slow-motion replay scenes, and super-imposed text captions to distinguish the key events. Therefore, high level semantics can be detected using these editing effects and the audio-visual features that are automatically extracted. Processing of sports videos (e.g., detection of important events and creation of summaries) makes it possible to deliver sports videos over narrow-band networks (such as the Internet and wireless), since the valuable semantics generally occupy only a small portion of the whole content [1]. One of the challenging problems in a video event detection method is the event boundary detection. Some methods, such as [2], propose a frame-based algorithm for event detection, while other methods, such as [3] and [1], use the temporal video segments for extracting more meaningful semantic units for event detection. In our method, like [3], we have used the “play-break” sequence as a semantic unit in our event detection. Each play-break consists of two sections called “play” and “break”. In soccer videos, the game is in a “play” mode when the game is going on and the “break” mode is the complement set; that is, whenever the game is halted because of occurrence of an event. A novel method for shot boundary detection and shot view classification is proposed. For shot boundary detection, both compressed and spatial domain features are used. For shot view classification, a new method based on object detection is proposed. This paper also presents a novel method for segmenting the video into its “play-break” sequences. From each play-break sequence some features are extracted which are classified by using a Bayesian network. The Bayesian network consists of a single hidden variable and several observable random variables. Network structure is determined by the Chow-Liu algorithm. We have proposed a novel method for calculating