MAXIMUM NEGENTROPY BEAMFORMING WITH SUPERDIRECTIVITY Kenichi Kumatani 1 , Liang Lu 2 , John McDonough 1 , Arnab Ghoshal 1 and Dietrich Klakow 1 1 Spoken Language Systems at Saarland University in Saarbr¨ ucken, Germany 2 The Centre for Speech Technology Research at University of Edinburgh in Edinburgh, United Kingdom web : http://distant-automatic-speech-recognition.org ABSTRACT This paper presents new superdirective beamforming algorithms based on the maximum negentropy (MN) criterion for distant au- tomatic speech recognition. The MN beamformer is configured in the generalized sidelobe canceler structure, and uses the weights derived from a delay-and-sum beamformer as the quiescent weight vector. While satisfying the distortionless constraint in the look di- rection, it adjusts the active weight vector to make the output maxi- mally super-Gaussian. The current paper proposes to use the weights of a superdirec- tive beamformer as the quiescent vector, which results in improved directivity and noise suppression at lower frequencies. We demon- strate the effectiveness of our approach through far-field speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The technique proposed in the current paper reduces the word error rate (WER) by 56% relative to a single distant microphone baseline, which is a 14% reduction in WER over the original MN beamformer formulation. 1. INTRODUCTION Microphone array processing techniques for hands-free speech recognition have the potential to relieve users from the necessity of donning close talking microphones (CTMs) before dictating or otherwise interacting with automatic speech recognition (ASR) sys- tems [1, 2]. Adaptive beamforming is a promising technique for far-field speech recognition. A conventional beamformer in generalized sidelobe canceller (GSC) configuration is structured such that the direct signal from a desired direction is undistorted [2, §6.7.3]. Typ- ical GSC beamformers consist of three blocks, a quiescent vector, blocking matrix and active weight vector. The quiescent vector is calculated to provide unity gain for the direction of interest. The blocking matrix is usually constructed in order to keep a distortion- less constraint for the signal filtered with the quiescent vector. Sub- ject to the constraint, the total output power of the beamformer is minimized through the adjustment of an active weight vector, which effectively places a null on any source of interference, but can also lead to undesirable signal cancellation [3]. To avoid the latter, many algorithms have been developed. Those approaches could be clas- sified into the following : 1. updating the active weight vector only when noise signals are dominant [4], 2. constraining the update formula for the active weight vector [5], 3. blocking the leakage of desired signal components into the side- lobe canceller by designing the blocking matrix [5, 6], and 4. using acoustic transfer functions from a desired source to micro- phones instead of just compensating time delays [4, 6]. Those algorithms attempt to minimize the almost same criterion based on the the second-order statistics (SOS), the total output power while keeping the distortionless constraint. The research leading to these results has received funding from the Eu- ropean Community’s Seventh Framework Programme (FP7/2007-2013) un- der grant agreement number 213850 and the Cluster of Excellence on Mul- timodal Computing and Interaction. We know from the field of independent component analysis (ICA) that nearly all information bearing signals, like subband sam- ples of speech, are non-Gaussian [7]. On the other hand, noisy or reverberant speech consist of a sum of several signals, and as such tend to have a distribution that is closer to Gaussian. This follows from the central limit theorem, and can be empirically verified [8]. Hence, by making the distribution of the beamformer’s outputs as much non-Gaussian as possible, we can remove the effects of noise and reverberation. In [8], we proposed a novel beamforming algorithm which ad- justed the active weight vectors so as to make the beamformer’s out- put maximally non-Gaussian. As a measure for the degree of non- Gaussianity we use negentropy, which is the difference between the entropy of the output signal calculated under a Gaussian assump- tion and that calculated under a non-Gaussian assumption. In other words, negentropy is a measure for the amount by which the dis- tribution of the beamformer’s output deviates from a Gaussian with the same mean and variance. We also showed in [8] that such a beamformer can reduce noise and reverberation without suffering from the signal cancellation problem. The MN beamformer proposed in [8] used the weights of a delay-and-sum beamformer, which compensates time delays of ar- rival of a desired speech signal to the microphone array, as the quiescent vector. However, due to the limited aperture of the mi- crophone array, such a delay-and-sum beamforming method cannot suppress interference signals at low frequencies. Since the output of the quiescent vector influences the negentropy of the beamformer’s output, presence of noise in that output degrades the ability of the beamformer to suppress noise or reverberation by estimating the ac- tive weight vector based on the maximum negentropy criterion. A superdirective beamformer alleviates this problem by having better directivity at lower frequencies. The balance of this paper is organized as follows. Section 2 re- views the super-Gaussian distribution and shows the fact that the ac- tual speech distribution is not Gaussian but super-Gaussian, which is the main motivation for using the maximum negentropy crite- rion. In Section 3 and Section 4, we review the definitions of the entropy and negentropy, respectively. In Section 5, we describe the super-directive beamformer. Section 6 describes the new maximum negentropy beamformer in the GSC configuration. In Section 7, we describe the results of far-field automatic speech recognition exper- iments. Finally, in Section 8, we present our conclusions and plans for future work. 2. MODELING SUBBAND SAMPLES OF SPEECH WITH SUPER-GAUSSIAN PROBABILITY DENSITY FUNCTIONS In this section we provide empirical evidence that the probability density function (pdf) of speech is super-Gaussian. We use a gener- alized Gaussian pdf to model the distribution of the subband speech samples. 2.1 Generalized Gaussian pdf The generalized Gaussian (GG) pdf is well-known and finds fre- quent application in the blind source separation (BSS) and ICA fields. Moreover, it subsumes the Gaussian and Laplace pdfs as special cases. The GG pdf with zero mean for a real-valued r.v. y 18th European Signal Processing Conference (EUSIPCO-2010) Aalborg, Denmark, August 23-27, 2010 © EURASIP, 2010 ISSN 2076-1465 2067