Cross Entropy for Learning in Multimodal Streams Athanasios K. Noulas 1 , Nikos Vlassis 1 , and Ben J.A. Kr ¨ ose 1 University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands, {anoulas, vlassis, krose}@science.uva.nl, http://staff.science.uva.nl/{ anoulas/, vlassis/, krose/} Abstract. In this paper we present a variation of the Cross Entropy method that can be applied on Dynamic Bayesian Networks for efficient learning of the model parameters. We demonstrate the results achieved on real world video streams using a variety of DBNs. Finally we compare this approach to the traditional EM algorithm, in terms of computational complexity, memory requirements and robustness to initialization. 1 Introduction In multimodal streams containing people talking a very important process is person identification. Ideally we would like to know which persons appear and who is the speaker on every instant of the stream. This information can later be used for content extraction [13], speaker detection tasks [13] and intelligent summary creation [7]. Face and voice identification can be executed very robustly in the presence of training data. However, in contexts like meeting rooms or news videos data we would like to per- form this identification without the use of labeled examples. In order to achieve optimal results, both prior knowledge about the domain and the underlying information in the temporal dimension of the data should be exploited. Furthermore, we should fuse the information coming from different modalities efficiently. This can be achieved in a probabilistic framework with the use of Dynamic Bayesian Networks [11]. DBNs can model complex relationships between variables and incorpo- rate prior knowledge of the domain and the temporal dimension of the data. Further- more, they are a well defined framework to infer the state of hidden variables. In our Fig. 1. Example frames from our video sequence. The persons can appear simultaneously, the location of their face might change rapidly and some of their facial features might be occluded.