6
Surveillance Using Both Video and Audio
Yigithan Dedeoglu
1
, B. Ugur Toreyin
2
, Ugur Gudukbay
1
, and A. Enis Cetin
2
1
Department of Computer Engineering, Bilkent University
2
Department of Electrical and Electronics Engineering, Bilkent University
It is now possible to install cameras monitoring sensitive areas but it may not
be possible to assign a security guard to each camera or a set of cameras. In ad-
dition, security guards may get tired and watch the monitor in a blank manner
without noticing important events taking place in front of their eyes. Current
CCTV surveillance systems are mostly based on video and recently intelligent
video analysis systems capable of detecting humans and cars were developed
for surveillance applications. Such systems mostly use Hidden Markov Models
(HMM) or Support Vector Machines (SVM) to reach decisions. They detect
important events but they also produce false alarms. It is possible to take
advantage of other low cost sensors including audio to reduce the number of
false alarms. Most video recording systems have the capability of recording
audio as well. Analysis of audio for intelligent information extraction is a rela-
tively new area. Automatic detection of broken glass sounds, car crash sounds,
screams, increasing sound level at the background are indicators of important
events. By combining the information coming from the audio channel with
the information from the video channels, reliable surveillance systems can be
built. In this chapter, current state of the art is reviewed and an intelligent
surveillance system analyzing both audio and video channels is described.
6.1 Multimodal Methods for Surveillance
Multimodal methods have been successfully utilized in the literature to im-
prove the accuracy of automatic speech recognition, human activity recog-
nition and tracking systems [355, 595]. Multimodal surveillance techniques
are discussed in a recent edited book by Zhu and Huang [595]. In [597], sig-
nals from an array of microphones and a video camera installed in a room
are analyzed using a Bayesian based approach for human tracking. A speech
recognition system comprising of a visual as well as a audio processing unit is
proposed in [147]. A recent patent proposes a coupled hidden Markov model
based method for audiovisual speech recognition [361].
P. Maragos et al. (eds.), Multimodal Processing and Interaction,
DOI: 10.1007/978-0-387-76316-3 c Springer Science+Business Media, LLC 2008 6,
©Springer Science+Business Media, LLC 2008