Adaptive Speech Recognition and Dialogue Management for Users with Speech Disorders I. Casanueva, H. Christensen, T. Hain, P. Green Department of Computer Science, University of Sheffield, United Kingdom i.casanueva@sheffield.ac.uk, {h.christensen,t.hain,p.green}@dcs.shef.ac.uk Abstract Spoken control interfaces are very attractive to people with se- vere physical disabilities who often also have a type of speech disorder known as dysarthria. This condition is known to de- crease the accuracy of automatic speech recognisers (ASRs) es- pecially for users with moderate to severe dysathria. In this paper we investigate how applying probabilistic dialogue man- agement (DM) techniques can improve interaction performance of an environmental control system for such users. The effect of having access to different amounts of adaptation data, as well as using different vocabulary size for speakers of different intel- ligibilities is investigated. We explore the effect of adapting the DM models as the ASR performance increases, such as is the case in systems where more adaptation data is collected through system use. Improvements compared to a non-probabilistic DM baseline are seen both in terms of dialogue length and suc- cess rate, 9% and 25% mean relative improvement respectively. Looking at just the more severe dysarthric speakers these num- bers rise 25% and 75% mean relative improvement. These im- provements are higher when the ASR data adaptation amount is small. Further results show that a DM trained on data from multiple speakers outperform a DM trained on data from a sin- gle speaker. Index Terms: dysarthric speech, dialogue management, envi- ronmental control system 1. Introduction Automatic speech recognisers (ASRs) have poor performance for dysarthric users, but often these users have physical dis- abilities that would make speech enabled environmental con- trol (EC) very attractive. A previous study [1] has shown that maximum a posteriori (MAP) adaptation can significantly in- crease the accuracy of the ASR for mild dysarthric speakers, but the performance is still very low for speakers with lower in- telligibility. During the last decade, probabilistic dialogue man- agement (DM) for spoken dialogue systems has shown promis- ing performance in terms of robustness when used in conjunc- tion with systems with high word error rate. These techniques showed a higher relative improvement as the ASR performance decreases than other DM techniques [2]. Another advantage of using probabilistic DM is that its dialogue policy can be op- timised regarding to a specific reward function using reinforce- ment learning [3], meaning that the dialogue manager will auto- matically adapt its behaviour to the ASR characteristics. Prob- abilistic DM was previously used with dysarthric speakers with promising results [4]. In [5] we presented an EC system (homeService) for dysarthric speakers, where the ASR acoustic models are up- dated as more adaptation data is collected. One of the key points of this system is the adaptation to a specific user, meaning that the performance of the system will improve as the user interacts more with it. Probabilistic DM naturally fits into this system configuration, since it will be able to adapt its dialogue policy both to the user specific characteristics, as well as to the ASR accuracy changing over time. A slightly different DM configu- ration has been implemented, where the dialogue models track- ing the current state of the dialogue (user intention) change over time as the ASR accuracy changes. The main improvement we expect to obtain from this framework is to have a system which has a more ”conservative” dialogue policy, meaning that it will ask more confirmation questions when the performance of the ASR is poorer, and is able to automatically adapt this policy to a more straightforward one when the ASR performance im- proves. 2. Dysarthric data UASpeech [6] is by far the largest database of dysarthric speech suitable for training acoustic models for ASR. It includes speech from 15 speakers with a range of impairment levels with a total of about 18 hours of speech. Each recording is a single word, and the database includes 10 numbers, the 26 NATO alphabet letters, 19 command words, 100 common words, and 300 un- common words. All speakers have dysarthric speech with different severity. The database comes with some information about the speakers’ characteristics as well as their intelligibility measure, which is clustered in 4 groups; very low,(2% to 15%, 4 speakers); low, (28% to 43%, 3 speakers); mid,(58% to 62%, 3 speakers) and high,(86% to 95%, 5 speakers). 3. Effect of amount of adaptation data and vocabulary size on ASR accuracy Maximum a posteriori (MAP) adaptation [7] has been shown to be a successful way of establishing acoustic models when faced with limited amounts of data from a given speaker. In [1], ac- curacy results on the UASpeech task are presented, using about 40 mins of data for each speaker employing the whole 455 word vocabulary. Here we investigate the effect on accuracy of using less data for adaptation, as would be the case when initially set- ting up e.g. a new EC system. The effect reducing the vocabu- lary and hence the decoder confusability is also investigated. The ASR accuracy results are shown in fig. 1; each line shows the mean and standard deviation for each intelligibility group, as a function of the amount of data used for MAP adapta- tion. Especially in the 36 word case, the accuracy improvement converges after a certain amount of data. Reducing the vocabu- lary size has the effect of increasing the accuracy and decreasing the amount of data needed until this convergence point, which is a key point for a system like [5]. Collecting large amounts of enrolment data from a dysarthric user is infeasible but 36 com- Copyright 2014 ISCA 14 - 18 September 2014, Singapore INTERSPEECH 2014 1033 10.21437/Interspeech.2014-269