6 Surveillance Using Both Video and Audio Yigithan Dedeoglu 1 , B. Ugur Toreyin 2 , Ugur Gudukbay 1 , and A. Enis Cetin 2 1 Department of Computer Engineering, Bilkent University 2 Department of Electrical and Electronics Engineering, Bilkent University It is now possible to install cameras monitoring sensitive areas but it may not be possible to assign a security guard to each camera or a set of cameras. In ad- dition, security guards may get tired and watch the monitor in a blank manner without noticing important events taking place in front of their eyes. Current CCTV surveillance systems are mostly based on video and recently intelligent video analysis systems capable of detecting humans and cars were developed for surveillance applications. Such systems mostly use Hidden Markov Models (HMM) or Support Vector Machines (SVM) to reach decisions. They detect important events but they also produce false alarms. It is possible to take advantage of other low cost sensors including audio to reduce the number of false alarms. Most video recording systems have the capability of recording audio as well. Analysis of audio for intelligent information extraction is a rela- tively new area. Automatic detection of broken glass sounds, car crash sounds, screams, increasing sound level at the background are indicators of important events. By combining the information coming from the audio channel with the information from the video channels, reliable surveillance systems can be built. In this chapter, current state of the art is reviewed and an intelligent surveillance system analyzing both audio and video channels is described. 6.1 Multimodal Methods for Surveillance Multimodal methods have been successfully utilized in the literature to im- prove the accuracy of automatic speech recognition, human activity recog- nition and tracking systems [355, 595]. Multimodal surveillance techniques are discussed in a recent edited book by Zhu and Huang [595]. In [597], sig- nals from an array of microphones and a video camera installed in a room are analyzed using a Bayesian based approach for human tracking. A speech recognition system comprising of a visual as well as a audio processing unit is proposed in [147]. A recent patent proposes a coupled hidden Markov model based method for audiovisual speech recognition [361]. P. Maragos et al. (eds.), Multimodal Processing and Interaction, DOI: 10.1007/978-0-387-76316-3 c  Springer Science+Business Media, LLC 2008 6, ©Springer Science+Business Media, LLC 2008