217 6 th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007 Adaptive Database Reduction for Domain Specific Speech Synthesis Aleksandra Krul 1 , 2 eraldine Damnati 1 , Franc ¸ois Yvon 2 , C´ edric Boidin 1 , Thierry Moudenc 1 1 France T´ el´ ecom R&D Division, TECH/SSTP 2, avenue Pierre Marzin, 22307 Lannion Cedex, France {aleksandra.krul,geraldine.damnati,cedric.boidin,thierry.moudenc}@orange-ftgroup.com 2 GET/ENST and CNRS/LTCI 46, rue Barrault, 75624 Paris Cedex 13, France yvon@enst.fr Abstract This paper raises the issue of speech database reduction adapted to a specific domain for Text-To-Speech (TTS) syn- thesis application. We evaluate several methods: a database pruning technique based on the statistical behaviour of the unit selection algorithm and a novel method based on the Kullback- Leibler divergence. The aim of the former method is to elim- inate the least selected units during the synthesis of a domain specific training corpus. The aim of the latter approach is to build a reduced database whose unit distribution approximates a given target distribution. We compare the reduced databases. Finally we evaluate these methods on several objective mea- sures given by the unit selection algorithm. 1. Introduction Current Text-To-Speech systems are based on concatenative methods [1]. Such systems use a large database of pre-recorded speech from which acoustic units are selected for concatena- tion. The scalability of the database is an important issue in unit selection based speech synthesis. Indeed, the use of the full database is not always suitable or even possible for some applications. The database has to be reduced so that the speech synthesis system can be integrated into different devices. Two approaches are commonly used for database reduc- tion. In a ”bottom-up” approach the database is examined in order to remove spurious and redundant units. For instance, in [2, 3] units are clustered according to some similarity measures concerning prosodic and phonetic contexts. Only units that are representative of each cluster are kept in the reduced database. More recently an LSM (Latent Semantic Mapping) method was proposed in [4]. The ”top-down” approach is based on the investigation of the output of the synthesizer. One of the implementations con- sists in synthesizing a large amount of data and removing units which are not frequently used by the synthesizer. This approach is based on the statistical behaviour of the unit selection algo- rithm and was originally proposed in [5]. The advantage of such a method is that no knowledge about speech units is needed. It is closely dependent on the unit selection algorithm behaviour. However, the reduced synthesis systems are often used for specific applications such as menu readers in the mobile phones. The reduced database has to be adapted to the domain specific application. In this paper we are interested in this particular paradigm. Our goal is to prune the generic database and to adapt it in or- der to synthesize a domain specific application corpus in dif- ferent devices that do not support a large amount of data. As the acoustic realization of a specific domain is not known the use of methods such as in [2, 3, 4] is not possible for the re- duction adapted to a specific application. We investigate then two approaches: a variant of a reduction method based on the statistical behaviour of the unit selection and a novel reduction method guided by the Kullback-Leibler measure. The first reduction method that we use is a ”top-down” ap- proach. Instead of synthesizing a generic corpus we propose to use a domain specific corpus that reflects the application for which the reduction has to be performed. We will show that even if the specific corpus is not very large we obtain better objective results than if we collect statistics by synthesizing a much bigger generic corpus. The second approach that we investigate is based on the Kullback-Leibler divergence and was introduced in [6]. This method was used for designing a textual corpus for the speech synthesis application. The main idea of this method is that the distribution of units in the constructed corpus aims to be close to an a priori distribution. In [6] the flexibility of this method is put forward: the algorithm is able to accommodate different distributions which may prove better for domain specific TTS synthesis applications. We use this method to construct a re- duced database whose unit distribution is close to the domain specific distribution. The distribution of units in the reduced database can be adapted to any domain. The advantage of this method is that it is independent of the speech synthesis system. In section 2, we present several approaches for adaptive database reduction. In section 3 we objectively evaluate all of the methods and present experimental results. 2. Presentation of methods 2.1. Database pruning based on the statistical behaviour of the unit selection algorithm The main idea of this pruning method is to keep the units that are the most often used to synthesize a representative corpus while the least selected units are pruned. Our system uses di- phone as elementary unit. Each diphone (about 1200 in French) is present several times (from 1 to thousands) in the acoustic database: each acoustic realization is called a diphone variant or a unit. When synthesizing a message, each variant can or cannot be selected. The number of times it is selected is called number of occurrences. The first step consists in synthesizing