ABSTRACT In noisy listening conditions, the information available on which to base speech recognition decisions is necessarily incomplete: some spectro-temporal regions are dominated by other sources. We report on the application of a variety of techniques for missing data in speech recognition. These techniques may be based on marginal distributions or on reconstruction of missing parts of the spectrum. Application of these ideas in the Resource Management task shows performance which is robust to random removal of up to 80% of the frequency channels, but falls off rapidly with dele- tions which more realistically simulate masked speech. We report on a vowel classification experiment designed to isolate some of the RM problems for more detailed exploration. The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained. 1. BACKGROUND The missing data problem arises naturally in many pattern recog- nition tasks [2,8] where elements of data vectors to be classified are unavailable during training and/or recognition. The causes of incomplete evidence include unreliable sensors, band-restricted data transmission (e.g. the spectral filtering action of a telephone channel), or partial occlusion of the desired pattern by an interfer- ing signal. In the latter case, it is assumed that some preprocessor is able to determine which parts of the mixed observation corre- spond to the source to be classified. Our motivation for studying the missing data problem derives from ongoing studies at Sheffield and elsewhere [1] on computa- tional auditory scene analysis (CASA), in which evidence for dif- ferent sound sources is separated using auditory grouping principles. CASA is an attractive paradigm for robust ASR. It makes no assumptions about the type and number of acoustic sources which make up the mixture, and does not require prior exposure to these sources. However, separation will never be able to recover all the evidence: there will be some regions where other sound sources dominate. CASA-based robust ASR requires that the resulting missing data problem be confronted. In previous work [4,9] we demonstrated that it is possible to remove high proportions (up to 90%) of the input spectrum with- out significant deterioration in recognition rates. In ICASSP-95, we reported (using NOISEX) noise tolerance comparable to that of human listeners when only those spectro-temporal regions with a favourable local SNR were retained. Subsequently, we have applied missing data techniques to the Resource Management (RM) task [5]. The main results of that study are outlined in sec- tion 3 of this paper. The RM experiments highlight a number of outstanding problems with the practical application of missing data ideas. Here, we address these issues with a more focussed problem, that of TIMIT vowel identification using a Gaussian classifier (section 4). This task allows for a comparison of missing data techniques which would have been computationally infeasi- ble on RM, and decouples the observation probability estimation problem from the problem of finding the best model sequence. 2. MISSING DATA TECHNIQUES FOR MULTIVARIATE GAUSSIAN DISTRIBUTIONS Missing components of pattern vectors can either be estimated or ignored. Estimates assume an importance in situations where reconstruction of the data vector is required, possibly for further processing (e.g. further pattern transformation prior to classifica- tion), or for regeneration (e.g. resynthesis). Ignoring missing data means attempting to classify the observation solely on the basis of the information present. It has been argued [2] that it can be inap- propriate to replace missing values with any estimate. Both kinds of approach benefit from some model for the process giving rise to the observations. Here, we assume that the observa- tion vector x belongs to one of a number of classes, each of which is modelled as a mixture of K multivariate Gaussian distributions: (1) S j represents model j (or, e.g., an emitting state of an HMM for class j), c ij is the weight of mixture i for model j, and is the n- dimensional Gaussian distribution (mean , covariance C) (2) The missing data problem for pattern classification is the compu- tation of for an incomplete vector x. It will be convenient to re-order x as , where x p and x m represent, respec- tively, the subvectors of present and missing components. To sim- plify things further, we will drop model subscript j and present the required formulae for the single mixture condition. All arguments presented here are applicable to the multiple-mixture case. The mean and covariance matrix are similarly partitioned: , (3) One simple estimation technique is to replace missing values by unconditional model means (so-called mean imputation) i.e. (4) This approach makes no use of present components and hence cannot exploit information in the covariance. An alternative is to calculate model means conditioned on those components present. For multivariate Gaussians, this conditional distribution is also Gaussian [12], with mean and covariance: (5) (6) For unconditional and conditional mean replacement techniques, classification proceeds by computing fxS j ( 29 c ij φ x μ ij C ij , , ( 29 i 1 = K ∑ = φ μ φ x μ C , , ( 29 1 2 π ( 29 0.5 n C 0.5 exp 0.5 - x μ - ( 29 t C 1 - x μ - ( 29 ( 29 = fxS j ( 29 x x p x m ( 29 = μ μ p μ m ( 29 = C C pp C pm C mp C mm = x m μ m = x mp μ m C pm t C pp 1 - x p μ p - ( 29 + = C mp C mm C pm t C pp 1 - C pm - = MISSING DATA TECHNIQUES FOR ROBUST SPEECH RECOGNITION Martin Cooke, Andrew Morris & Phil Green {m.cooke,a.morris,p.green}@dcs.shef.ac.uk Computer Science, University of Sheffield, Regent Court, 211, Portobello Street, Sheffield, UK