Hindawi Publishing Corporation
EURASIP Journal on Advances in Signal Processing
Volume 2009, Article ID 807162, 12 pages
doi:10.1155/2009/807162
Research Article
Exploiting Temporal Feature Integration for
Generalized Sound Recognition
Stavros Ntalampiras,
1
Ilyas Potamitis (EURASIP Member),
2
and Nikos Fakotakis
1
1
Electrical and Computer Engineering Department, University of Patras, 26500 Rio-Patras, Greece
2
Department of Music Technology and Acoustics, Technological Educational Institute of Crete,
Daskalaki-Perivolia, Crete 74100, Greece
Correspondence should be addressed to Stavros Ntalampiras, sntalampiras@upatras.gr
Received 13 July 2009; Revised 25 September 2009; Accepted 18 November 2009
Recommended by Douglas O’Shaughnessy
This paper presents a methodology that incorporates temporal feature integration for automated generalized sound recognition.
Such a system can be of great use to scene analysis and understanding based on the acoustic modality. The performance of three
feature sets based on Mel filterbank, MPEG-7 audio protocol, and wavelet decomposition is assessed. Furthermore we explore the
application of temporal integration using the following three different strategies: (a) short-term statistics, (b) spectral moments,
and (c) autoregressive models. The experimental setup is thoroughly explained and based on the concurrent usage of professional
sound effects collections. In this way we try to form a representative picture of the characteristics of ten sound classes. During the
first phase of our implementation, the process of audio classification is achieved through statistical models (HMMs) while a fusion
scheme that exploits the models constructed by various feature sets provided the highest average recognition rate. The proposed
system not only uses diverse groups of sound parameters but also employs the advantages of temporal feature integration.
Copyright © 2009 Stavros Ntalampiras et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
Humans have the ability to detect and recognize a sound
event quite effortlessly. Moreover we can concentrate on a
particular sound event, isolating it from background noise,
for example, focus on a conversation while loud music is
playing. During the last decades emphasis has been placed
upon methods for automated speech/speaker recognition.
This is due to the fact that speech plays an important role as
regards to both human-human and human-machine interac-
tions. While this area has reached the maturity of launching
commercial products, the area of nonspeech audio process-
ing still needs attention since it has the potential to provide
solutions to a number of various applications. The domain
of audio recognition is currently dominated by techniques
which are mainly applied to speech technology [1]. This fact
is based on the assumption that all audio streams can be
processed in a common manner, even if they are emitted
by different sources. In general, the goal of generalized
audio recognition technology is the construction of a system
that can efficiently recognize its surrounding environment
by solely exploiting the acoustic modality (computational
auditory scene analysis [2]). Every sound source exhibits a
consistent acoustic pattern which results in a specific way of
distributing its energy on its frequency content. This unique
pattern can be discovered and modeled by utilizing statistical
pattern recognition algorithms. However there exists a vari-
ety of obstacles that need to be tackled when such a system
operates under real world conditions. When we have to deal
with a large number of different sound classes, the recogni-
tion performance is decreased. Moreover, the categorization
of sounds into distinct classes is sometimes ambiguous (an
audio category may overlap with another) while composite
real-world sound scenes can be very difficult to analyze. This
fact has led to solutions which target specific problems while
a generic system is still an open research subject.
Lately, generic audio classification technology has been
used for the needs of several emerging real-world applica-
tions, such as environmental monitoring, bioacoustic identi-
fication, acoustic surveillance, applications to music, context
awareness by robots, and so forth [3–8]. The purpose of
this work is the extensive evaluation of sound parameters of