Kari Torkkola Motorola, Phoenix Corporate Research Labs 2100 East Elliot Road, MD EL508, Tempe, AZ 85284, USA a540aa@email.mot.com http://members.home.net/torkkola We attempt to give an overview of current research in blind separa- tion of convolutive mixing of signals, concentrating on audio sig- nals, and methods applicable thereof. We briefly enumerate some application areas, we present two possible taxonomies of separa- tion methods, one based on system parametrizations, the other on different criteria to solve the problem. We wade through the liter- ature following these taxonomies. We also discuss what might yet be missing in the current research. We presuppose familiarity with the concepts of Independent Com- ponent Analysis (ICA) and Blind Signal Separation (BSS). If this is not the case the reader is advised to refer to [1, 8, 15, 20, 37]. BSS seemingly has a large number of potential applications in the audio realm. The generic application is, of course, separation of simultaneous audio sources in reverberating or echoing environ- ment, that is, in a natural environment, for example, inside a room. We will enumerate here only a few actual applications, the reader is free to use her imagination to develop more. A very desirable application area would be signal enhance- ment by removing noise or other unwanted signal components us- ing blind separation methods as in [31], for example. In this area only one signal is of interest, the rest is considered as nuisance. Enhancement of voice quality in mobile phones, would be one im- portant application, especially in car environments. Since voice coders used in cell phones are optimized for coding speech alone, the combination of excessive noise with a speech signal results in a poor sound quality. Some initial experiments in this area can be found in [78]. Making voice dialling or speech recognition in general more viable in noisy environments would fall into the same category [115, 48]. Spying, intelligence, or forensic applications fall also under the same category whereby the interest might be in picking up one important signal amongst others [69]. In audio communications transparency refers to reproduced audio being ideally free from reverberation, noise, acoustical echoes, and mixed other speakers [82]. Teleconferencing and speakerphones are two areas where speech signal aquisition with transparency is desirable. Combining existing multi-channel acoustical echo cancellation technology with BSS has been shown to be useful in a teleconferencing setup [82]. Hearing aids are also another lucrative application area for speech enhancement through BSS. Whether some of the above can yet be a viable and profitabe application area, is an open question and will be touched upon in the concluding section. Current limitations in the methods might render some applications if not impossible, at least impractical. Besides audio, an extremely fruitful application arena is digital communications. While the basic concept remains the same (mul- tiple transmitters at same frequency, multiple antennas receiving multiple mixtures, for example), there are a few important differ- ences. Signals in this context are man-made and thus their prop- erties are completely known in advance. This can (and should be) be exploited in devising separation methods. Another difference is that signals could be transmitted in short bursts, which might call for block-based algebraic methods rather than adaptive meth- ods [102]. The paper at hand, however, concentrates in methods applicable to audio. The main contribution of the paper is an attempt to collect a major part of the literature pertinent to BSS and audio, and to present that literature in light of two possible taxonomies of sep- aration methods. These are based on how to parametrize the sep- arating structure, and on the criteria used for separation. Due to the breadth of the literature and page limitations, the presentation cannot be but superficial. We also discuss what the technology yet might lack to produce successful applications. Any method for separation of convolutive mixtures can be roughly divided in three essential components: 1) parametrization of the separation system (filters or matrices), 2) the separation criterion, and 3) the method to optimize the chosen criterion. We concentrate in looking at the first two components, and mention only briefly that the optimization methods can be coarsely divided into adap- tive and algebraic approaches, and the former category can fur- ther be subdivided into stochastic gradient type algorithms (with or without 2nd order information, i.e. Newton's method), and func- tion zero search algorithms or fixed point methods. The latter cat- egory mainly consists of methods to jointly and/or approximately diagonalize a number of matrices. In this section we discuss what alternatives exist for the parametrization of the separating system and possibly for the parametrization of the source signals if the method at hand requires this. Most real convolutive mixing scenarios with audio can be modeled as a feedforward mixing network having FIR filters in its branches. A room with multiple simultaneous sound sources and multiple microphones is an example, where the mixing filters are room im- pulse responses between each source and each microphone. The separation system, ideally inverting the effect of the mix- ing system, can also be modeled as a feedforward network of FIR- filters that approximate the required inverse filters. Consider a sim- ple 2x2 mixing case in the z-domain: