Computational auditory sceneanalysis: listening to several things at once Martin Cooke, Guy J. Brown, Malcolm Crawford and Phil Green The problem of distinguishing particular sounds, such as conversation, against a background of irrelevant noise is a matter of common experience. Psychologists have studied it for some 40 years, but is is only comparatively recently that computer modelling of the phenomenon has been attempted. This article reviews progress made, possible practical applications, and prospects for the future. In most listening situations, a mixture of sounds reaches our ears. For example, at a crowded party there are many competing voices and other interfering noises, such as music. Similarly, the sound of an orchestra consists of a number of melodic lines that are played by a variety of instruments. Nonetheless, we are able to attend to a particular voice or a particular instrument in these situations. How does the ear achieve this apparently effortless segregation of concurrent sounds? E.C. Cherry [l] noted this phenomenon in 1953, and called it the ‘cocktail party problem’. Since then, the perceptual segre- gation of sound has been the subject of extensive psychological research. Recently, a thorough account of this work has been presented by AS. Bregman [2]. He con- tends that the mixture of sounds reaching the ears is subjected to a two-stage auditory Martin Cooke, B.Sc., Ph.D. Studied Computer Science and Mathematics at Manchester University, and received a doctorate in Computer Science from the University of Sheffield. He is currently a lecturer in computer science there. He has been active in speech and hearing research since 1962, and his research interests include speech segregation, speech coding and developmental speech synthesis. Guy J. Brown, BSc., Ph.D. He is a graduate of Sheffield Hallam University, and in 1992 obtained a doctorate in Computer Science from the University of Sheffield where he is now a lecturer in computer science. He has studied computational models of hearing since 1969, and also has research interests in music perception and virtual reality. Malcolm Crawford, B.Sc. Graduated in psychology at the University of Sheffield. Currently a Research Associate working on object-oriented blackboard architectures for auditory scene analysis. Phil Green, B.Sc., Ph.D. Graduated in Cybernetics and Instrument Physics from the University of Reading in 1967, and obtained a doctorate from the University of Keele in 1971. He is the founder of the speech and hearing research group at the University of Sheffield, where he is currently a Senior Lecturer in the Department of Computer Science. His research interests include the combination of symbolic, statistical, and connectionist models in automatic speech recognition. Endeavour, New Series, Volume 17, No. 4, 1993. Olw-9327/93 $6.00 + 0.00. 0 1993 Pergamon Pw4s Ltd. Printed in Great Srltain. 186 scene analysis (ASA). In the first stage, the acoustic signal is decomposed into a number of ‘sensory components’. Subsequently, components which are likely to have arisen from the same environmental event are recombined into perceptual structures that can be interpreted by higher-level processes. Although ASA is documented compre- hensively in the literature, there have been few attempts to investigate the phenomenon with a computer model. In this article, we describe progress on a model of auditory processing which is able to simulate some aspects of ASA. The model characterises an acoustic signal as a collection of time- frequency components, which we call synchrony strands [3], and then searches the auditory scene in order to identify com- ponents with common properties. While our modelling studies have their own intrinsic scientific merits, they are also motivated by a number of possible appli- cations. Firstly, the performance of auto- matic speech recognition (ASR) systems is poor in the presence of background noise. In contrast, human listeners with normal hearing are quite capable of following a conversation in a noisy environment. This suggests that models of auditory processing could provide a robust front-end for ASR systems. A related point is that human listeners with abnormal hearing have difficulty in under- standing speech in noisy environments. These listeners generally have neural defects of the cochlea, and are not helped by con- ventional hearing aids which simply amplify the speech and background noise together. A better solution would be to provide an ‘intelligent’ hearing aid able to attenuate noises, echoes, and the sounds of competing talkers, while amplifying a target voice. A model of ASA could form the basis for such a hearing aid. Other applications of this work lie in the field of music processing. An example is provided by the transcription of recorded polyphonic music, for which it is necessary to identify how many notes are being played at a particular time, and to which instru- ments they belong. A model of ASA could provide the basis for an automatic transcrip- tion system by performing this segregation. Such a system could be faster and more accurate than manual techniques, and would provide an efficient means of transcribing recorded music which is not notated (e.g. much of folk, ethnic, and popular music). Also, a transcription system would provide feedback in music teaching, allowing a player to compare the transcription of his performance with the original score. Some early work on this is reported in G.J. Brown and M.P. Cooke [4]. Auditory scene analysis In his book, Bregman [2] makes a distinction between two types of perceptual grouping: namely primitive grouping and schema- driven grouping. Primitive grouping is driven by the incoming acoustic data, and is probably innate. In contrast, schema- driven grouping employs the knowledge of familiar patterns and concepts that have been acquired through experience of acoustic environments. Many primitive grouping principles can be described by the Gestalt principles of perceptual organisation. The Gestalt psychologists (e.g. K. Koffka [5]) proposed a number of rules governing the manner in which the brain forms mental patterns from elements of its sensory input. Although these principles were generally described first in relation to vision, they are equally applicable to audition. A potent Gestalt principle is common fate, which states that elements changing in the same way at the same time probably belong together. There is good evidence that the auditory system exploits common fate by grouping acoustic compo- nents that exhibit changes in amplitude at the same time. Similarly, grouping by harmonicity can be phrased in terms of the Gestalt principle of common fate. When a person speaks, vibration of the vocal cords generates energy at the fundamental frequency of vibration and also at integer multiples (harmonics) of this frequency. Hence, the components of a single voice can be grouped by identifying acoustic com- ponents that have a common spacing in frequency. We now describe the processing carried out by the peripheral auditory system, and a computer model which stimulates some aspects of auditory scene analysis. Auditory representations The peripheral auditory system - consisting of the outer, middle, and inner ear - serves to transform acoustic energy into a neural