Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2009, Article ID 807162, 12 pages doi:10.1155/2009/807162 Research Article Exploiting Temporal Feature Integration for Generalized Sound Recognition Stavros Ntalampiras, 1 Ilyas Potamitis (EURASIP Member), 2 and Nikos Fakotakis 1 1 Electrical and Computer Engineering Department, University of Patras, 26500 Rio-Patras, Greece 2 Department of Music Technology and Acoustics, Technological Educational Institute of Crete, Daskalaki-Perivolia, Crete 74100, Greece Correspondence should be addressed to Stavros Ntalampiras, sntalampiras@upatras.gr Received 13 July 2009; Revised 25 September 2009; Accepted 18 November 2009 Recommended by Douglas O’Shaughnessy This paper presents a methodology that incorporates temporal feature integration for automated generalized sound recognition. Such a system can be of great use to scene analysis and understanding based on the acoustic modality. The performance of three feature sets based on Mel ﬁlterbank, MPEG-7 audio protocol, and wavelet decomposition is assessed. Furthermore we explore the application of temporal integration using the following three diﬀerent strategies: (a) short-term statistics, (b) spectral moments, and (c) autoregressive models. The experimental setup is thoroughly explained and based on the concurrent usage of professional sound eﬀects collections. In this way we try to form a representative picture of the characteristics of ten sound classes. During the ﬁrst phase of our implementation, the process of audio classiﬁcation is achieved through statistical models (HMMs) while a fusion scheme that exploits the models constructed by various feature sets provided the highest average recognition rate. The proposed system not only uses diverse groups of sound parameters but also employs the advantages of temporal feature integration. Copyright © 2009 Stavros Ntalampiras et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. Introduction Humans have the ability to detect and recognize a sound event quite eﬀortlessly. Moreover we can concentrate on a particular sound event, isolating it from background noise, for example, focus on a conversation while loud music is playing. During the last decades emphasis has been placed upon methods for automated speech/speaker recognition. This is due to the fact that speech plays an important role as regards to both human-human and human-machine interac- tions. While this area has reached the maturity of launching commercial products, the area of nonspeech audio process- ing still needs attention since it has the potential to provide solutions to a number of various applications. The domain of audio recognition is currently dominated by techniques which are mainly applied to speech technology [1]. This fact is based on the assumption that all audio streams can be processed in a common manner, even if they are emitted by diﬀerent sources. In general, the goal of generalized audio recognition technology is the construction of a system that can eﬃciently recognize its surrounding environment by solely exploiting the acoustic modality (computational auditory scene analysis [2]). Every sound source exhibits a consistent acoustic pattern which results in a speciﬁc way of distributing its energy on its frequency content. This unique pattern can be discovered and modeled by utilizing statistical pattern recognition algorithms. However there exists a vari- ety of obstacles that need to be tackled when such a system operates under real world conditions. When we have to deal with a large number of diﬀerent sound classes, the recogni- tion performance is decreased. Moreover, the categorization of sounds into distinct classes is sometimes ambiguous (an audio category may overlap with another) while composite real-world sound scenes can be very diﬃcult to analyze. This fact has led to solutions which target speciﬁc problems while a generic system is still an open research subject. Lately, generic audio classiﬁcation technology has been used for the needs of several emerging real-world applica- tions, such as environmental monitoring, bioacoustic identi- ﬁcation, acoustic surveillance, applications to music, context awareness by robots, and so forth [3–8]. The purpose of this work is the extensive evaluation of sound parameters of