Analysis of major factors of naturalness degradation in concatenative synthesis Toshio Hirai † Hisashi Kawai †,‡ Minoru Tsuzaki †,* Nobuyuki Nishizawa † † ATR Spoken Language Translation Research Labs., 2–2–2 Hikaridai Seika-cho Souraku-gun Kyoto 619–0288, Japan ‡ KDDI R&D Labs., Japan *Kyoto City University of Arts, Japan toshio.hirai@atr.jp Abstract To effectively improve a speech synthesis system, it is im- portant to ﬁnd and focus on improving the modules whose effect on naturalness degradation in synthesized speech are the largest. In this paper, we describe the design of a perception experiment to measure the effect of each module separately. Synthesized speech stimuli whose intermediate information is modiﬁed during a synthesis process are used in the experiment. A perception experiment in which a Japanese concatenative speech synthesis system was evaluated revealed that the text processing module and a part of the feature prediction module (for the fundamental frequency) of the system were the major factors in degrading naturalness. 1. Introduction Concatenative Text-to-Speech (TTS) systems are composed of 4 major elemental modules, namely a text processing module, an acoustic- and prosodic-feature prediction module, a segment selection module, and a waveform concatenation module. Al- though all the modules are expected to be improved, the re- sources to achieve this are not always sufﬁcient. In such a case, it is necessary to ﬁnd and focus on improving the modules whose effect on naturalness degradation in synthesized speech are the largest. In order to ﬁnd such modules, the degrada- tion amount must be measured separately. However, only the gross system performance has been focused on (e.g., intelligi- bility/understandability tests for syllable/word/sentence, and a prosodic naturalness evaluation test) in past research related to Japanese speech synthesis systems[1, 2, 3]. In this paper, we describe the design of an experiment for the evaluation of each module’s performance separately. The target modules for evaluation are limited to the text processing module and the feature prediction module (to put it more pre- cisely, its subordinate parts, i.e., acoustic- and prosodic-feature prediction part), whose performance is closely related to the naturalness of synthesized speech. In the case of the perfor- mance evaluation of the text processing module, the amount of naturalness degradation of a speech stimulus synthesized with non-corrected intermediate output from the module is measured with reference to that of the stimulus synthesized with phone- and prosody-corrected intermediate output. The amount of nat- uralness degradation of the stimulus synthesized with the fea- tures which are generated from one of the estimated features and the features extracted from natural speech is measured with reference to the stimulus synthesized with the features extracted from natural speech in the case of a performance evaluation for the feature prediction module. We applied the method to a con- catenative speech synthesis system called XIMERA[4], which has been developed at ATR. The target languages of XIMERA are Japanese and Chinese, although the method was applied only for Japanese in this paper. An experiment in which 40 lis- teners participated was conducted, and the details of the results and a discussion of the experiments are described. 2. XIMERA speech synthesis system The data ﬂow in XIMERA is as described below: (1) The text processing module processes the input Japanese text (a mix- ture of Chinese characters and Japanese phonetic characters) in order to notate linguistic- (morphologic and phonetic) and prosodic- (accentual) information. (2) The passed information is used to predict the time series of acoustic/prosodic informa- tion with an acoustic- and prosodic-feature prediction module. In the case of XIMERA, HTS (H MM-based T riple S s (Speech Synthesis System))[5, 6] is used as the module. For the training of the feature prediction model in HTS, sentences in an ATR phonetically balanced task (ATR503)[7] (503 sentences), sen- tences in a travel guidebook task (665), and sentences in a news- paper story task (498) were used (total: 1,666). The set of the predicted features is called the “target.” The acoustic feature is Mel-cepstrum (mcep), and the prosodic features are segment duration and speech fundamental frequency (F0). (3) In the segment selection module, “target cost” and “join cost” are cal- culated based on the target features and the attributes of speech segments in a speech database. The costs are integrated into one cost for each segment sequence candidate, and an appropri- ate sequence that shows the lowest cost is selected. (4) Finally, the selected segments are joined smoothly into a waveform in the waveform generation module, and it is output as synthesized speech. In the following section, a method to estimate the nat- uralness degradation of synthesized speech caused by modules, and its application to the modules in XIMERA are detailed. 3. Analysis of major factors of naturalness degradation The target that is used to generate a “natural enough” syn- thesized speech (hereafter called “speech (a)” or just “(a)”) is the “natural target.” There are many candidates for the natu- ral target. A target that is extracted from natural speech “(n)” is a good choice since it is easy to acquire and its stability is guaranteed. In this paper, the target is treated as the natural one. The inﬂuence on naturalness degradation of synthesized speech caused by the feature prediction module’s imperfectness is estimated separately by measuring the naturalness degrada- tion of speech that is synthesized with the merged features of the natural target and a predicted feature (either mcep, segment duration, or F0) from speech (a). (Each synthesized speech is