Predicting how it sounds: Re-ranking dialogue prompts based on TTS quality for adaptive Spoken Dialogue Systems Cedric Boidin 1 , Verena Rieser 2 , Lonneke van der Plas 3 , Oliver Lemon 2 , Jonathan Chevelu 1 1 Orange Labs, Lannion, France 2 School of Informatics, University of Edinburgh, UK 3 Department of Linguistics, University of Geneva, CH cedric.boidin@orange-ftgroup.com,vrieser@inf.ed.ac.uk Abstract This paper presents a method for adaptively re-ranking para- phrases in a Spoken Dialogue System (SDS) according to their predicted Text To Speech (TTS) quality. We collect data under 4 different conditions and extract a rich feature set of 55 TTS runtime features. We build predictive models of user ratings us- ing linear regression with latent variables. We then show that these models transfer to a more speciﬁc target domain on a sep- arate test set. All our models signiﬁcantly outperform a random baseline. Our best performing model reaches the same perfor- mance as reported by previous work, but it requires 75% less annotated training data. The TTS re-ranking model is part of an end-to-end statistical architecture for Spoken Dialogue Systems developed by the ECFP7 CLASSI C project. Index Terms: speech synthesis, spoken dialogue systems, TTS quality prediction, re-ranking. 1. Introduction Ultimately, one would like to know “how good it will sound” before generating a prompt in a Spoken Dialogue System (SDS). Evidence from a corpus collected by [1] shows the im- portance of considering Text To Speech (TTS) quality for SDS: 5.2% of the user utterances indicate a problem with the TTS quality. For example, the user asks the system to repeat because s/he was not able to understand what the system said. In this paper we present a re-ranker model to select para- phrases that are predicted to sound most natural when being synthesised with unit selection TTS, following previous work by [2], [3]. However, our approach requires 75% less anno- tated data, while reaching the same prediction performance. We ﬁrst gather training data on user ratings of synthesised prompts (Section 2). We then use linear regression with latent variables to predict the perceived user rating (Section 3). Finally, we evaluate the model on a separate test corpus of paraphrases of possible system prompts (Section 4). This re-ranker model is used in the general architecture of the CLASSI C project (see ﬁgure 1). 1 In this project we propose a end-to-end statisti- cal treatment of uncertainty and context adaptive strategies for Automatic Speech Recognition (ASR), Spoken Language Un- derstanding (SLU), Dialogue Management (DM), Natural Lan- guage Generation (NLG), and TTS. In this framework, the re- ranker chooses between alternative inputs to the TTS module according to text-to-speech module capabilities. These alterna- 1 European Community’s Seventh Framework Pro- gramme (FP7/2007-2013) under grant agreement no.216594 www.classic-project.org. tive inputs are constructed by the statistical NLG module of the CLASSI C architecture [4]. The target domain is Interactive Voice Response (IVR) applications, especially troubleshooting and customer service domains (see for example [5]). Ulti- mately, this work should lead to more robust SDS with higher user ratings, because the system will be able to select its best possible output based on an accurate predictive model of users’ TTS ratings. Figure 1: End-to-end statistical CLASSI C architecture for SDS 2. Training data collection 2.1. Synthesised utterances We ﬁrst collect training data. 144 different utterances were synthesised with a female French voice of the state-of-the-art unit selection Orange Labs speech synthesiser. 2 The utter- ances were taken from two application domains: IVR appli- cations (e.g. “J’annule votre demande.”- I cancel your order.), and movie subtitles (e.g. “Tu me donnes quel ˆ age ?”- How old do you think I am?). We choose the training data to represent a wider spectrum than just the CLASSI C target domain, in order to build a more general model, which has the potential to trans- fer to different settings. Furthermore, each utterance was syn- thesised with two different versions of the TTS voice. One ver- sion uses the full inventory (approximately 3 hours of speech), and the other one uses a reduced inventory (30% of the full in- ventory, approximately 1 hour of speech, selected randomly). Thus, the training set contains utterances from 4 different con- ditions, contributing equal portions to the training set (n=288): IVR/full are the IVR utterances synthesised with the full acoustic inventory, IVR/red are the IVR utterances that are synthesised with the reduced inventory, ST/full are the sub- title utterances synthesised with the full inventory and ST/red are the subtitle utterances synthesised with the reduced inven- tory. 2 See demonstrator at http://tts.elibel.tm.fr Copyright  2009 ISCA 6 - 10 September, Brighton UK 2487 10.21437/Interspeech.2009-662