217 6 th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007 Adaptive Database Reduction for Domain Speciﬁc Speech Synthesis Aleksandra Krul 1 , 2 G´ eraldine Damnati 1 , Franc ¸ois Yvon 2 , C´ edric Boidin 1 , Thierry Moudenc 1 1 France T´ el´ ecom R&D Division, TECH/SSTP 2, avenue Pierre Marzin, 22307 Lannion Cedex, France {aleksandra.krul,geraldine.damnati,cedric.boidin,thierry.moudenc}@orange-ftgroup.com 2 GET/ENST and CNRS/LTCI 46, rue Barrault, 75624 Paris Cedex 13, France yvon@enst.fr Abstract This paper raises the issue of speech database reduction adapted to a speciﬁc domain for Text-To-Speech (TTS) syn- thesis application. We evaluate several methods: a database pruning technique based on the statistical behaviour of the unit selection algorithm and a novel method based on the Kullback- Leibler divergence. The aim of the former method is to elim- inate the least selected units during the synthesis of a domain speciﬁc training corpus. The aim of the latter approach is to build a reduced database whose unit distribution approximates a given target distribution. We compare the reduced databases. Finally we evaluate these methods on several objective mea- sures given by the unit selection algorithm. 1. Introduction Current Text-To-Speech systems are based on concatenative methods [1]. Such systems use a large database of pre-recorded speech from which acoustic units are selected for concatena- tion. The scalability of the database is an important issue in unit selection based speech synthesis. Indeed, the use of the full database is not always suitable or even possible for some applications. The database has to be reduced so that the speech synthesis system can be integrated into different devices. Two approaches are commonly used for database reduc- tion. In a ”bottom-up” approach the database is examined in order to remove spurious and redundant units. For instance, in [2, 3] units are clustered according to some similarity measures concerning prosodic and phonetic contexts. Only units that are representative of each cluster are kept in the reduced database. More recently an LSM (Latent Semantic Mapping) method was proposed in [4]. The ”top-down” approach is based on the investigation of the output of the synthesizer. One of the implementations con- sists in synthesizing a large amount of data and removing units which are not frequently used by the synthesizer. This approach is based on the statistical behaviour of the unit selection algo- rithm and was originally proposed in [5]. The advantage of such a method is that no knowledge about speech units is needed. It is closely dependent on the unit selection algorithm behaviour. However, the reduced synthesis systems are often used for speciﬁc applications such as menu readers in the mobile phones. The reduced database has to be adapted to the domain speciﬁc application. In this paper we are interested in this particular paradigm. Our goal is to prune the generic database and to adapt it in or- der to synthesize a domain speciﬁc application corpus in dif- ferent devices that do not support a large amount of data. As the acoustic realization of a speciﬁc domain is not known the use of methods such as in [2, 3, 4] is not possible for the re- duction adapted to a speciﬁc application. We investigate then two approaches: a variant of a reduction method based on the statistical behaviour of the unit selection and a novel reduction method guided by the Kullback-Leibler measure. The ﬁrst reduction method that we use is a ”top-down” ap- proach. Instead of synthesizing a generic corpus we propose to use a domain speciﬁc corpus that reﬂects the application for which the reduction has to be performed. We will show that even if the speciﬁc corpus is not very large we obtain better objective results than if we collect statistics by synthesizing a much bigger generic corpus. The second approach that we investigate is based on the Kullback-Leibler divergence and was introduced in [6]. This method was used for designing a textual corpus for the speech synthesis application. The main idea of this method is that the distribution of units in the constructed corpus aims to be close to an a priori distribution. In [6] the ﬂexibility of this method is put forward: the algorithm is able to accommodate different distributions which may prove better for domain speciﬁc TTS synthesis applications. We use this method to construct a re- duced database whose unit distribution is close to the domain speciﬁc distribution. The distribution of units in the reduced database can be adapted to any domain. The advantage of this method is that it is independent of the speech synthesis system. In section 2, we present several approaches for adaptive database reduction. In section 3 we objectively evaluate all of the methods and present experimental results. 2. Presentation of methods 2.1. Database pruning based on the statistical behaviour of the unit selection algorithm The main idea of this pruning method is to keep the units that are the most often used to synthesize a representative corpus while the least selected units are pruned. Our system uses di- phone as elementary unit. Each diphone (about 1200 in French) is present several times (from 1 to thousands) in the acoustic database: each acoustic realization is called a diphone variant or a unit. When synthesizing a message, each variant can or cannot be selected. The number of times it is selected is called number of occurrences. The ﬁrst step consists in synthesizing