Cross-Modal Analysis of Audio-Visual Programs for Speaker Detection Dongge Li * , Cuneyt Taskiran * , Nevenka Dimitrova , Wei Wang * , Mingkun Li , and Ishwar Sethi * Multimedia Research Laboratory (MRL), Motorola Labs, 1301 E. Algonquin Rd., Schaumburg, IL 60196 Email: {dongge.li, cuneyt.taskiran, wei.wang}@motorola.com Philips Research, 345 Scarborough Rd., Briarcliff Manor, NY 10510 Email: nevenka.dimitrova@philips.com Intelligent Information Engineering Laboratory, Oakland University, Rochester, MI 48309 Email: {sethi, li}@oakland.edu Abstract— This paper describes a speaker detection system using cross-modal association methods. Four association ap- proaches are designed using linear and nonlinear association models. Speaker detection experiments were conducted to com- pare the approaches. I. I NTRODUCTION Multimedia content usually contains two or more media streams that share semantic and/or temporal relationships. Examples of such media streams are synchronized video and audio streams for audiovisual programs, a text and its spoken presentation, and images and the captions associated with them. In these examples media streams of different modality jointly contribute to the overall semantics of the multimedia content. In multimedia content analysis and in- dexing, generally the modalities are processed separately and the outputs of these unimodal systems are fused in a final combination stage. However, in this process of separation of modalities, valuable information is lost about the whole event and/or object that is to be analyzed and detected. Cross-modal multimedia processing systems that share information across all levels of processing will lead to synergistic integration of multiple modalities and thus will have better analysis and de- tection performance compared with systems that deal with the modalities separately. The cross-modal approach is also well supported biologically, where cross-modal influences between different perceptions, such as visual, auditory, and olfactory inputs, occur at the earliest stages of sensory processing [2]. We refer to the task of employing cross-modality informa- tion analysis methods to identify and measure intrinsic associ- ations between media streams of different modalities as cross- modal association. In this paper we propose several cross- modal association approaches based on both linear and nonlin- ear correlation models. Although the proposed approaches are applicable to any cross-modal association task, in this paper we will only consider audiovisual signals with synchronized audio and video. The performance of the proposed approaches are compared on a video analysis task, which is speaker detection when more than one face is present in the video. There has been several efforts to associate talking heads in video with speech in the audio stream. Slaney and Covell [7] propose FaceSync as an optimal linear detector, which com- bines the information from all pixels to measure audio-visual synchronization. Fisher et al. [3] present a non-parametric approach to learn the joint distribution of audio and visual features. They first project the data into a maximally informa- tive, low-dimensional subspace, and then model the stochastic relationships using a nonparametric density estimator. Li at al. [5] propose several cross-modal association approaches and compare their performances on retrieval and talking head analysis tasks. Their results show that the cross-modal factor analysis method they propose has the best performance in both tasks. Cross-modality association problems have been examined for other modalities as well. Barnard et al. [1] propose learning models for the joint statistics of image components, such as segmented regions, and the keywords associated with the images. These models are then used for image retrieval and annotation. The organization of this paper is as follows: In Section II a talking head detection system based on linear and nonlin- ear association models is proposed and various cross-modal association approaches are described. Section III presents the experimental results for speaker detection. Conclusions and further work suggestions are given in Section IV. II. SPEAKER DETECTION USING CROSS-MODAL ASSOCIATION Figure 1 shows the block diagram of the proposed talking head analysis system. To detect the current speaker, first candidate face regions are located using a face detection module. Facial features are then extracted from each face region. In parallel, audio features are extracted from the audio stream. We have used 12 Mel-Frequency Cepstral Coefficients (MFCCs) as the audio features. Audio classification is per- formed to determine the speech regions in the audio stream using the algorithm described in [6]. Cross-modal association between audio and video is performed only when speech is