BLIND SEPARATION OF SPEAKERS IN NOISY REVERBERANT ENVIRONMENTS: A NEURAL NETWORK APPROACH A. Koutras, E. Dermatas and G. Kokkinakis WCL, Electrical and Computer Engineering Dept, University of Patras, 26100 Patras, HELLAS. e-mail: koutras@giapi.wcl2.ee.upatras.gr Abstract In this paper we present neural network solutions to the Blind Signal Separation problem of simultaneous speech signals in reverberant noisy rooms. The separation networks that are used, are feedforward and recurrent neural networks, along with a proposed hybrid network. These networks perform separation of convolutive speech mixtures in the time domain, without any prior knowledge of the propagation media, based on the Maximum Likelihood Estimation (MLE) criterion. The proposed separation networks improve more than 30dB the Signal to Interference ratio in a two simultaneous speaker environment, even under the presence of a noise source (more than 15dB improvement in a 0 dB SNR noisy environment). In addition, the recognition accuracy of a continuous phoneme-based speech recognition system was improved more than 20% in all adverse mixing situations with high interference from competing speakers and noise. Therefore, the proposed separation networks can be used as a front-end processor for continuous speech recognition of simultaneous speakers in real reverberant rooms. 1 Introduction The problem of Blind signal separation (BSS) consists of recovering unknown signals or ''sources'' from their observed mixtures. Typically, these mixtures are acquired by a number of sensors, where each sensor receives a different combination of the source signals. The term ''blind'' is justified by the fact that the only a-priori knowledge that we have for the signals is their statistical independence and no information about the mixing model parameters and the transfer paths from the sources to the sensors is available beforehand. There are many potential applications of BSS and processing in science and technology, particularly in wireless communication [1], noninvasive medical diagnosis of EEG, MEG and ECG [2], geophysical exploration, image enhancement and recognition and speech processing. Acoustic examples include the separation of signals from several microphones in a sound field that is produced by several speakers with the possible co-existence of noise. This situation is known as the ''cocktail party problem'' [3-8]. Generally, a great number of heuristic algorithms for BSS of speech signals have been proposed. These algorithms have originated not only from the traditional signal processing theory, but also from various other backgrounds such as neural networks, information theory, statistics, system theory and information geometry [9,10]. However, most of them deal with the instantaneous mixture of sources and only a few methods examine the situation of convolutive mixtures of speech signals. The case of instantaneous mixture is the simplest one of BSS and can be encountered only when multiple speakers are talking simultaneously in an anechoic room. However, when dealing with real room acoustics one has to consider the echoes and the delayed versions of the speech signals as well. This case is the most frequently encountered situation in real world where speech signals from multiple speakers are received by a number of microphones located in the room. Each microphone acquires speech signals from all speakers constituting of several delayed and modified copies of the original sound sources that reflect off of walls and objects in the room. Depending upon the amount and the type of the room noise, the strength of the echoes and the amount of the reverberation, the resulting speech signals that are received by the microphones may be highly distorted and minimize the efficacy of any speech recognition system. To this direction BSS techniques must be used as a front end to separate the convolutive mixtures of the speech signals and improve the recognition accuracy of ASR systems. Blind signal separation for improving the speech recognition rate in real reverberant environments has been mostly tested for the case that each speaker is positioned near to a microphone [6-8] in a noise-free multi- simultaneous speaker environment. On the other hand, in most noisy scenarios, noise is considered to be additive in the acquired speech signals [6]. The challenging task though arises in the case where the speakers and the microphones are positioned arbitrarily in a room and furthermore, when the noise sources are not located near the microphones but they are arbitrarily positioned in the room as well. Work that deals with the problem of speech