Eurospeech 2001 - Scandinavia Reducing spectral mismatches in concatenative speech synthesis via systematic database enrichment. Maria Founda, George Tambouratzis, Aimilios Chalamandaris, George Carayannis Institute for Language and Speech Processing 6, Artemidos str. & Epidavrou, Paradissos Amaroussiou 151 25, Athens, Greece Email: {mfounda, achalam, giorg_t, gcara}@ilsp.gr Abstract This paper presents work performed for the Time-Domain TTS system, which is being developed at the ILSP for the Greek language. It focuses on the enhancement of the synthetic speech quality, by reducing the spectral mismatches between concatenated segments. To that end, a study has been performed to determine the distance that can best predict when a spectral mismatch is audible. Experimentation with different spectral distances has taken place and the distance with the best performance has been used in order to systematically enrich the segment database, which initially contained only one instance per segment. Results of this procedure indicate a substantial improvement on the synthetic speech quality. 1. Introduction This work focuses on research associated with the new ILSP Text-To-Speech (TTS) system, which is based on the concatenative synthesis paradigm, speech being synthesized by concatenating diphones. In such systems the generated synthetic speech although highly intelligible, tends to sound unnatural. This is mainly attributable to the mismatches that take place at the joints of the diphones and to the distortion injected by the prosodic modifications of the speech signals. In order to deal with this problem, the development of a sophisticated algorithm that chooses the best matching unit is required [1], [2]. This solution requires that the database contains several instances of each diphone taken from different contexts, and therefore with different prosodic and spectral characteristics. During synthesis, the most appropriate instance should be chosen, so that both the mismatches at the joints and the prosodic modifications required are minimized. The goal of this paper is to study a specific source of distortion, namely the spectral mismatch between the concatenated diphones. In our case, instead of performing spectral manipulation to smoothen the joints at unit boundaries, spectral discontinuities are reduced by providing multiple diphone instances in the database. Then, a unit selection algorithm is employed to choose the best matching diphone-sequence. In order to achieve this, four spectral distances were compared in an attempt to find an optimal measure of audible spectral discontinuities. This step was followed by a systematic enriching procedure of the initial database (which contained only one distance of each diphone), in an attempt to minimise the mismatches at joints between vowels. The enriching was initially restricted to vowels because it has been reported that spectral mismatch at diphone joints has its greatest effect within vowels [2], [3]. Finally the quality of the synthetic speech generated by the enriched database was evaluated via both objective (analytical) and subjective methods. 2. Experimenting with four spectral distances In this section, the effect of spectral mismatches on synthetic speech was examined via specifically designed experiments. The focus was to determine the distance that most accurately indicates when a spectral mismatch is audible. To achieve that, listening tests were performed, using recorded speech samples. 2.1 Description of listening tests To implement the listening tests, initially speech material was obtained from a trained male speaker. The test utterances used in the experiment were presented as sets of phrases called test sets. In each test set the same phrase was repeated, having each time a single diphone replaced with other marked instances of the same phonetic identity originating from different contexts. All prosodic characteristics (duration, power and pitch contour) were smoothened to the values of the original diphone, thus leaving only one major source of distortion, the spectral mismatch. A combination of labeled samples and natural utterances provided reference points throughout the experiment to ensure the accuracy of the experimental results. A total of 31 subjects were used as listeners. For all subjects high-quality equipment was employed, in order to make even the slightest distortion audible (a detailed description of this procedure is given in [4]). 2.2 Evaluation of the listeners The set of listeners can be divided into two groups: Group1 whose members had a background in speech processing, and Group2 consisting of members who had no such background. Group1 consisted of 11 listeners and Group2 of 20 listeners. The marks attributed to the speech samples by the listeners were evaluated in terms of accuracy and consistency. This was performed in two ways: In each test set to be evaluated by the listeners, the natural utterance was included. Listeners who repeatedly marked these utterances with low scores should be excluded from the subset used to evaluate the spectral distances. In addition, a certain test set was presented both at the beginning and at the end of each listening test. This was used to reveal whether each listener was consistent in the