MODELLING CONFUSION MATRICES TO IMPROVE SPEECH RECOGNITION ACCURACY, WITH AN APPLICATION TO DYSARTHRIC SPEECH Omar Caballero Morales and Stephen Cox School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, U.K. S.Caballero-morales@uea.ac.uk, sjc@cmp.uea.ac.uk ABSTRACT Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited vocabulary decrease speech recognition accuracy. In this paper, we introduce a technique that can increase recognition accuracy in speakers with low intelligibility by incor- porating information from an estimate of the speaker’s phoneme confusion matrix. The technique performs much better than stan- dard speaker adaptation when the number of sentences available from a speaker for confusion matrix estimation or adaptation is low, and has similar performance for larger numbers of sentences. 1. INTRODUCTION “Dysarthria is a motor speech disorder that is often associated with irregular phonation and amplitude, incoordination of artic- ulators, and restricted movement of articulators” [4]. This con- dition can be caused by a stroke, cerebral palsy, traumatic brain injury, or a degenerative neurological disease such as Parkinson’s or Alzheimer’s Disease. The affected muscles by this condition may include the lungs, larynx, oropharynx and nasopharynx, soft palate and articulators (lips, tongue, teeth and jaw), and the de- gree to which these muscle groups are compromised determines the particular pattern of speech impairment [4]. This means that the design of an ASR system for dysarthric speakers is difficult, because as Rosen and Yampolsky [5] point out, they require dif- ferent types of ASR depending on their particular type and level of disability. Rosen and Yampolsky also identify factors that give rise to ASR errors [5], the most important being decreased in- telligibility (because of substitutions, deletions and insertions of phonemes), and limited phonemic repertoire, the latter leading to phoneme substitutions. In this paper, we describe a technique for incorporating a model of a speaker’s confusion matrix into the ASR process in such a way as to increase recognition accuracy. Although this technique has general application to ASR, we be- lieve that it is particularly suitable for use in ASR of dysarthric speakers who have low intelligibility due, in some degree, to a limited phonemic repertoire, and the results presented here con- firm this. To illustrate the effect of reduced phonemic repertoire, Figure 1 shows an example phoneme confusion matrix for a dysarthric speaker from the NEMOURS database [1](see section 3). This confusion matrix is estimated by an ASR system, and so it may show confusions that would not actually be made by humans, and also spurious confusions that are actually caused by poor tran- scription/output alignment (see section 2.2). However, since we concerned with machine rather than human recognition here, we can make the following observations: 1. A small set of phonemes (in this case the phonemes “ax”, “ih”, “b”, “d”, “dh”, “n” and “z”) dominates the speaker’s ax Response Stimulus ih b ddh n z Figure 1: A phoneme confusion matrix for a dysarthric speaker . output speech. 2. Some vowel sounds and the consonants “sh” and “th” are never recognised. This suggests that there are some phonemes that the speaker apparently cannot enunciate at all, and for which he or she substitutes a different phoneme, often one of the dominant phonemes mentioned above. Most speaker adaptation algorithms are based on the principle that it is possible to apply a set of transformations to the parameters of a set of acoustic models of an “average” voice to move them closer to the voice of an individual. Whilst this has been shown to be successful for normal speakers, it may be less successful in cases where the phoneme uttered is not the one that was intended but is substituted by a different phoneme. In this situation, we argue that a more effective approach is to combine a model of the substitutions likely to have been made by the speaker with a lan- guage model to infer what was said. We imagine that the speaker wished to utter a word sequence W in which can be transcribed us- ing a dictionary into the phoneme sequence S in . 1 The sequence of phones decoded by the speech recogniser is S out , and we con- struct a model that makes use of an extended confusion matrix estimated for the speaker plus a standard language model to es- timate W in from S out . More details of this are given in the next section. 1 For present purposes, we sidestep the issue of multiple pronuncia- tions and hence multiple phoneme transcriptions of a word, something that occurs relatively infrequently.