Hierarchical Segmentation of Presentation Videos through Visual and Text Analysis Honglin Li and Aijuan Dong Department of Computer Science, North Dakota State University Fargo, ND 58105 {honglin.li, aijuan.dong}@ndsu.edu Abstract - Presentation videos play an important role in information sharing and exchange. To effectively utilize these video assets, one of the important steps is to segment a long video stream into smaller, semantic units. In this paper, we investigate hierarchical segmentation of presentation videos by combining visual and text analysis. Slide-level segmentation employs visual information and computes a sequence of slide-level video segments so that the projected slide image of each such segment does not change. Topic- level segmentation makes use of extracted slide text and generates a sequence of topic-level video segments so that the topic of each such video segment does not change. This proposed segmentation procedure has been tested against various presentation videos and experimental results are presented and discussed. Keywords - hierarchical video segmentation, presentation video, visual information, text analysis, and topic words. 1. INTRODUCTION With recent advances in multimedia processing and automatic presentation recording, a large number of presentation videos are produced from conferences, lectures, meetings, and corporate trainings. These presentation videos cover a wide spectrum of topics and play an important role in information sharing and exchange. However, due to unstructured and liner features of videos, people often feel difficulties in locating a specific piece of information in a presentation video. To ensure effective exploitation of these video assets, efficient and flexible access mechanisms must be provided. Research found multimedia users strongly prefer hierarchical video access. With hierarchical presentation, video content is organized at different granularity levels, which allows a user to flexibly access some video segments of his/her particular interest. In a search scenario, instead of returning a whole video that contains a lot of irrelevant information, the most relevant video segment can be returned, thus increases the degree of video retrieval relevancy. To provide hierarchical video access, the first and important step is to hierarchically segment a long video stream into smaller, semantic units. A variety of techniques have been proposed to segment presentation videos. Earlier work from the Cornell Lecture Browser [1] uses a feature-based algorithm to segment a slide video stream. First, frames are clipped, filtered and adaptively thresholded to produce binary images. Then, feature differences between binary images are calculated and used to segment a slide video stream. Later on, Yamamoto et al. [2] propose topic segmentation of lecture videos by associating lecture speech with lecture textbook. The association is performed by computing the similarity between topic vectors obtained from lecture textbook and a sequence of lecture vectors obtained from lecture speech through spontaneous speech recognition. In another paper, a content density function is proposed to segment instructional videos [3]. The content density function draws guidance from the observation that topic boundaries coincide with ebb and flow of the “density” of content shown in videos. Recently, Lin et al [4] investigate a linguistics-based approach for lecture video segmentation. Multiple linguistic-based segmentation features from lecture speech, such as noun phrases and cue phrases, are extracted and explored. In spite of the successes, most approaches described above focus on linearly segmenting video streams into smaller units. In our study, we noticed that a presentation usually consists of many topics, and each topic covers several slides (Figure 1). This structure enables hierarchical segmentation, indexing and access. This paper focuses on hierarchical segmentation of presentation videos. Specifically, two-level video segmentation is investigated in our work: topic-level and slide-level. As in most video segmentations, visual information alone cannot reliably detect topic change. Segmentation at topic-level usually bases on related text analysis. In this paper, we study segmentation of presentation videos at topic-level through extracted slide text analysis. Segmentation at slide-level employs visual information. To map segmentation results from slide text analysis back to video segmentation and achieve hierarchical segmentation, matching between extracted key frames and converted slide images is performed through image edge analysis. The rest of the paper is organized as follows. We give an overview of the approach in Section 2, then discuss in detail slide-level segmentation and topic-level segmentation in Section 3 and 4 respectively. Experimental results are given Figure 1. Hierarchical view of presentations Topic 1 Presentation … Topic n Slide 1 … Slide m 1