AUTOMATIC RECOGNITION OF URBAN ENVIRONMENTAL SOUNDS EVENTS Stavros Ntalampiras † , Ilyas Potamitis ± , Nikos Fakotakis † † Wire Communicatioons Laboratory, University Of Patras, {dallas, fakotaki}@wcl.ee.upatras.gr , ± Department of Music Technology and Acoustics, Technological Educational Institute of Crete, potamitis@stef.teicrete.gr ABSTRACT Computer audition is an evolving and relatively new research field with many new applications. It would be of great convenience to live in an environment that can change automatically based on its “auditory sense”. In this work we propose a novel framework for automatic recognition of urban soundscenes. Our system facilitates a hierarchical classification schema while the performance of two well known feature sets is compared. A new post- processing algorithm to enhance the discrimination quality of MPEG-7 features is proposed and shown to provide improved results. Our approach is examined utilizing a compact testing procedure while MPEG-7 LLDs reach higher recognition rates than MFCCs. Index Terms— Computer Audition, Environmental sound recognition, MFCC, MPEG-7, Hidden Markov Models (HMM) 1. INTRODUCTION Nowadays we experience a lot of different types of urban sounds in our everyday life (car, motorcycle, crowd etc). Humans can effectively differentiate them quite effortless utilizing only the auditory sense. Think as a paradigm the situation were one is waiting at a traffic light. Using incoming sounds alone one is able to understand that a car is passing and a dog is barking in the presence of a horn sound. The general scope of our work is to build up a system that has the ability to automatically “understand” its surrounding environment by taking under consideration the sounds it “hears” alone. The area of computer audition faces an increasing demand in numerous applications (robotic awareness, environmental monitoring, media annotation etc) thus becoming a research field of great importance. Over the past decades a great deal of work has been published in the area of content-based audio classification. Eronen et al [1] explore an audio based recognition system used for classification of 24 urban contexts. They utilize several simplistic low-dimensional features as well as standard spectral descriptors along with an HMM-based classification scheme achieving 58% recognition rate. A framework for frame-level classification of noises belonging to five categories is presented in [2]. Line Spectral Frequencies (LSFs) in combination with a decision tree classifier were employed resulting in 88.1% classification accuracy. A method based on three MPEG-7 audio low-level descriptors (spectrum centroid, spectrum spread and spectrum flatness) is presented in [3]. For classification scheme a fusion of support vector machines and k nearest neighbour rule is adopted in order to assign a specific sound into predefined classes of common kinds of home environmental sounds. In this work we employed MFCCs and MPEG-7 descriptors. Our aim is to identify which feature set contains more discriminative information to serve the task of recognition of urban soundscenes. The rest of the paper is organized as follows. The next three sections describe the overall architecture of our implementation, the feature extraction methodology and the recognition procedure. Detailed analysis of the evaluation method as well as the results is given in the last part of the paper. 2. SYSTEM OVERVIEW In this section the architecture of our system which makes possible automatic recognition of urban soundscenes is described. We approach the issue based on the way that humans categorize subconsciously their surrounding environment using only the perceived acoustic information. Our goal is to distinguish scenes belonging to eight different classes: aircraft, motorcycle, car, crowd, thunder, wind, train, horn. We utilize a hierarchical classification scheme consisting of two stages and derived from a perceptual point of view of 110