MAXIMUM NEGENTROPY BEAMFORMING WITH SUPERDIRECTIVITY
Kenichi Kumatani
1
, Liang Lu
2
, John McDonough
1
, Arnab Ghoshal
1
and Dietrich Klakow
1
1
Spoken Language Systems at Saarland University in Saarbr¨ ucken, Germany
2
The Centre for Speech Technology Research at University of Edinburgh in Edinburgh, United Kingdom
web : http://distant-automatic-speech-recognition.org
ABSTRACT
This paper presents new superdirective beamforming algorithms
based on the maximum negentropy (MN) criterion for distant au-
tomatic speech recognition. The MN beamformer is configured in
the generalized sidelobe canceler structure, and uses the weights
derived from a delay-and-sum beamformer as the quiescent weight
vector. While satisfying the distortionless constraint in the look di-
rection, it adjusts the active weight vector to make the output maxi-
mally super-Gaussian.
The current paper proposes to use the weights of a superdirec-
tive beamformer as the quiescent vector, which results in improved
directivity and noise suppression at lower frequencies. We demon-
strate the effectiveness of our approach through far-field speech
recognition experiments on the Multi-Channel Wall Street Journal
Audio Visual Corpus (MC-WSJ-AV). The technique proposed in the
current paper reduces the word error rate (WER) by 56% relative to
a single distant microphone baseline, which is a 14% reduction in
WER over the original MN beamformer formulation.
1. INTRODUCTION
Microphone array processing techniques for hands-free speech
recognition have the potential to relieve users from the necessity
of donning close talking microphones (CTMs) before dictating or
otherwise interacting with automatic speech recognition (ASR) sys-
tems [1, 2].
Adaptive beamforming is a promising technique for far-field
speech recognition. A conventional beamformer in generalized
sidelobe canceller (GSC) configuration is structured such that the
direct signal from a desired direction is undistorted [2, §6.7.3]. Typ-
ical GSC beamformers consist of three blocks, a quiescent vector,
blocking matrix and active weight vector. The quiescent vector is
calculated to provide unity gain for the direction of interest. The
blocking matrix is usually constructed in order to keep a distortion-
less constraint for the signal filtered with the quiescent vector. Sub-
ject to the constraint, the total output power of the beamformer is
minimized through the adjustment of an active weight vector, which
effectively places a null on any source of interference, but can also
lead to undesirable signal cancellation [3]. To avoid the latter, many
algorithms have been developed. Those approaches could be clas-
sified into the following :
1. updating the active weight vector only when noise signals are
dominant [4],
2. constraining the update formula for the active weight vector [5],
3. blocking the leakage of desired signal components into the side-
lobe canceller by designing the blocking matrix [5, 6], and
4. using acoustic transfer functions from a desired source to micro-
phones instead of just compensating time delays [4, 6].
Those algorithms attempt to minimize the almost same criterion
based on the the second-order statistics (SOS), the total output
power while keeping the distortionless constraint.
The research leading to these results has received funding from the Eu-
ropean Community’s Seventh Framework Programme (FP7/2007-2013) un-
der grant agreement number 213850 and the Cluster of Excellence on Mul-
timodal Computing and Interaction.
We know from the field of independent component analysis
(ICA) that nearly all information bearing signals, like subband sam-
ples of speech, are non-Gaussian [7]. On the other hand, noisy or
reverberant speech consist of a sum of several signals, and as such
tend to have a distribution that is closer to Gaussian. This follows
from the central limit theorem, and can be empirically verified [8].
Hence, by making the distribution of the beamformer’s outputs as
much non-Gaussian as possible, we can remove the effects of noise
and reverberation.
In [8], we proposed a novel beamforming algorithm which ad-
justed the active weight vectors so as to make the beamformer’s out-
put maximally non-Gaussian. As a measure for the degree of non-
Gaussianity we use negentropy, which is the difference between the
entropy of the output signal calculated under a Gaussian assump-
tion and that calculated under a non-Gaussian assumption. In other
words, negentropy is a measure for the amount by which the dis-
tribution of the beamformer’s output deviates from a Gaussian with
the same mean and variance. We also showed in [8] that such a
beamformer can reduce noise and reverberation without suffering
from the signal cancellation problem.
The MN beamformer proposed in [8] used the weights of a
delay-and-sum beamformer, which compensates time delays of ar-
rival of a desired speech signal to the microphone array, as the
quiescent vector. However, due to the limited aperture of the mi-
crophone array, such a delay-and-sum beamforming method cannot
suppress interference signals at low frequencies. Since the output of
the quiescent vector influences the negentropy of the beamformer’s
output, presence of noise in that output degrades the ability of the
beamformer to suppress noise or reverberation by estimating the ac-
tive weight vector based on the maximum negentropy criterion. A
superdirective beamformer alleviates this problem by having better
directivity at lower frequencies.
The balance of this paper is organized as follows. Section 2 re-
views the super-Gaussian distribution and shows the fact that the ac-
tual speech distribution is not Gaussian but super-Gaussian, which
is the main motivation for using the maximum negentropy crite-
rion. In Section 3 and Section 4, we review the definitions of the
entropy and negentropy, respectively. In Section 5, we describe the
super-directive beamformer. Section 6 describes the new maximum
negentropy beamformer in the GSC configuration. In Section 7, we
describe the results of far-field automatic speech recognition exper-
iments. Finally, in Section 8, we present our conclusions and plans
for future work.
2. MODELING SUBBAND SAMPLES OF SPEECH WITH
SUPER-GAUSSIAN PROBABILITY DENSITY FUNCTIONS
In this section we provide empirical evidence that the probability
density function (pdf) of speech is super-Gaussian. We use a gener-
alized Gaussian pdf to model the distribution of the subband speech
samples.
2.1 Generalized Gaussian pdf
The generalized Gaussian (GG) pdf is well-known and finds fre-
quent application in the blind source separation (BSS) and ICA
fields. Moreover, it subsumes the Gaussian and Laplace pdfs as
special cases. The GG pdf with zero mean for a real-valued r.v. y
18th European Signal Processing Conference (EUSIPCO-2010) Aalborg, Denmark, August 23-27, 2010
© EURASIP, 2010 ISSN 2076-1465 2067