A framework for estimation of clean speech by fusion of outputs from multiple speech enhancement systems Venkatesh Krishnan, Phil S. Whitehead, David V. Anderson, and Mark A. Clements Georgia Institute of Technology, Atlanta, GA 30332 USA Abstract A novel multiple-input Kalman filtering (MIKF) framework is presented that estimates the clean speech signal by fu- sion of outputs from multiple speech enhancement systems. The MIKF framework generates a sample-by-sample minimum mean-square error estimate of the clean speech signal from these outputs. The residual noise in each input to the MIKF is modeled as an autoregressive (AR) process so that non-white noise can be accommodated, and the noise model is dynami- cally updated to handle non-stationary noise. Speech is also modeled as an AR process whose parameters are estimated from a codebook of suitably designed prototype AR parame- ters. Constraining the AR parameters via a codebook improves the quality and makes it easy to integrate the MIKF system with a speech coder. The proposed framework also has the flexi- bility to apply user-defined, heuristic weights to the inputs to the MIKF, which are the outputs of the contributing speech en- hancement systems. Perceptual quality tests and objective mea- sures (segmental signal-to-noise ratio) both demonstrate that the estimate of the clean speech signal generated by the MIKF is superior to any of its inputs. 1. Introduction Speech enhancement has been a topic of extensive research for the past five decades. Speech enhancement systems process speech signals degraded by noise to improve their perceptual quality and/or improve the performance of a speech coding or a recognition system [1]. Typically, speech enhancement sys- tems assume that the noise corrupting the speech signal is ad- ditive and uncorrelated with the latter, i.e., if s[t] is the clean speech signal and z[t] is the noisy observation at a sample time instance t, then z[t]= s[t]+ n[t] and E{s[t]n[t]} =0, where n[t] is the noise. Speech enhancement systems seek to estimate the clean speech signal s[t] from z[t] by minimizing the ex- pected value of a suitably chosen distortion function. The out- puts of speech enhancement systems often have residual noise and other artifacts, which are difficult to characterize analyti- cally. However, on a sample-by-sample basis, the estimate y[t] of the signal s[t] generated by a speech enhancement system can be assumed to have a residual noise signal v[t] and can be expressed as y[t]= s[t]+ v[t]. Based on the distortion function chosen and the strategy adopted to minimize the same, different speech enhancement systems yield different estimates of the clean speech signal s[t]. Therefore, it would be desirable to develop a “data fu- sion” framework for optimally combining the outputs of dif- ferent speech enhancement systems to obtain an improved esti- mate of the clean speech signal. The ability of a Kalman filter This work is sponsored by the Defense Advanced Research Projects Agency under contract N00024-02-C-6339. Opinions, inter- pretations, and recommendations are those of the authors and are not necessarily endorsed by the U.S. Government. to obtain a minimum mean-square error estimate (MMSE) of a signal on a sample-by-sample basis, using one or more noisy observations, makes it ideally suited for such a framework. Ever since Kalman filters were first reported in the 1960s, they have been widely used in signal estimation and tracking applications, as well as in speech processing [2] [3] [4]. In this paper, we present a novel framework employing multiple-input Kalman filters (MIKF) for optimally combining the outputs of multiple speech enhancement systems or other sources. The proposed MIKF framework assumes that the clean speech sig- nal and the residual noise present in the inputs to the MIKF can be modeled as independent Gaussian autoregressive (AR) pro- cesses. The AR model parameters for the MIKF framework are estimated using an iterative Expectation-Maximization (EM) al- gorithm [5]. The EM algorithm obtains a maximum-likelihood (ML) estimate of the AR model parameters. The AR model parameters for the speech are constrained to belong to a code- book of suitably designed AR model prototypes, trained on a database of clean speech. In generating a sample-by-sample MMSE estimate of the clean speech, the MIKF automatically weights each of its inputs in inverse proportion to the amount of residual noise present in that input. However, it may be desirable to impose additional heuristic weights to each of the inputs, which can be determined externally to the MIKF framework based on measures such as perceptual quality or intelligibility. The proposed framework has the flexibility to allow such heuristic weighting in a time- varying manner. A detailed description of how the parameters of the MIKF can be chosen to implement this weighting is pro- vided in Section 3. Furthermore, since the EM algorithm seeks to estimate optimally the AR parameters for the speech model and constrains them to belong to a codebook of prototype AR parameters, the MIKF framework is well suited to be efficiently used in conjunction with any model-based speech coder. Section 4 presents the results of a simulation in which speech enhancement outputs from two independent speech en- hancement systems and the original noisy signal are success- fully fused using the MIKF framework to estimate the clean speech signal. It is demonstrated that the estimate of the clean speech by the proposed system has a better segmental signal-to- noise ratio (SSNR) and perceptual quality than any of the inputs to the MIKF (which are the outputs of the speech enhancement systems). 2. Multiple-input Kalman Filtering Paradigm In this section, the mathematical formulation of the MIKF framework, shown in Fig. 1, is presented. At the sample time t, the MIKF takes the outputs y 1 [t],y 2 [t], ..., y K [t] from K independent speech enhancement systems or from other sources. Also at t, let the residual noise in the outputs 10.21437/Interspeech.2005-740