The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 Zhen-Hua Ling 1 , Long Qin 1 , Heng Lu 1 , Yu Gao 1 , Li-Rong Dai 1 , Ren-Hua Wang 1 , Yuan Jiang 2 , Zhi-Wei Zhao 2 , Jin-Hui Yang 2 , Jie Chen 2 , Guo-Ping Hu 2 1 University of Science and Technology of China 2 iFlytek Research, Hefei, Anhui, China zhling@ustc.edu Abstract This paper introduces the speech synthesis systems developed by USTC and iFlytek for Blizzard Challenge 2007. These two systems are both HMM-based ones and employ similar training algorithms, where contextual dependent HMMs for spectrum, F0 and duration are estimated according to the acoustic features and contextual information of training database. However, different synthesis methods are adopted for these two systems. In USTC system, speech parameters are generated directly from these statistical models and parametric synthesizer is used to reconstruct speech waveform. The iFlytek system is a waveform concatenation one, which uses maximum likelihood criterion of statistical models to guide the selection of phone-sized candidate units. Comparing the evaluation results of these two systems in Blizzard Challenge 2007, we find that the parametric synthesis system achieves better performance than unit selection method in intelligibility. On the other hand, the synthesized speech of the unit selection system is more similar to the original speech and more natural especially when the full training set is used. 1. Introduction In recent years, HMM-based parametric speech synthesis method has been proposed and made significant progress [1-3]. In this method, spectrum, pitch and duration are modeled simultaneously in a unified framework of HMMs [1] and the parameters are generated from HMMs under maximum likelihood criterion by using dynamic features [4]. Then parametric synthesizer is used to reconstruct speech signals. This method is able to synthesize highly intelligible and smooth speech. Besides, the voice character of synthetic speech can be controlled flexibly by employing some model adaptation methods [5]. However the speech quality of this method suffers from the unnatural output of parametric synthesizer even if some high quality speech vocoder, such as STRAIGHT [6], has been used. In order to overcome this problem, a HMM-based unit selection and waveform concatenation speech synthesis method has also been proposed [7,8]. In this method, likelihood and Kullback-Leibler divergence criterions of the trained HMMs are followed to select the optimal frame-sized or phone-sized unit sequence. Then the waveform of each candidate unit is concatenated to produce synthesized speech. The advantage of this method over conventional unit selection method is that statistical criterions are introduced into the calculation of target cost and concatenation cost, so the synthesis system can be trained automatically with little expert knowledge and manual tuning. Two systems which adopt each of the HMM-based parametric synthesis method and unit selection method are developed by USTC and iFlytek for Blizzard Challenge 2007. The flowchart of these two systems is shown in Figure 1. They share almost the same training algorithms but are distinct from each other in synthesis stage. This paper is organized as follows. Section 2 introduces the details about the HMM-based parametric synthesis system developed by USTC. Section 3 describes the unit selection method used in iFlytek system. Some descriptions about system building are presented in section 4. Section 5 gives the evaluation results and some discussions. Section 6 is the conclusion. SPEECH DATABASE Question set Text analysis STRAIGHT analysis HMM Training Contextual HMM sequence decision ML based parameter generation STRAIGHT filter SYNTHESIZED SPEECH TEXT Labels Speech Training Synthesis LSP based formant enhancement Spectrum & F0 HMMs Text analysis Contextual HMM sequence decision KLD based unit pre- selection Waveform concatenation SYNTHESIZED SPEECH ML based unit selection SPEECH DATABASE HMM-based Parametric Synthesis (USTC) HMM-based Unit Selection & Waveform Concatenation Synthesis (iFlytek) TEXT Figure 1: Flowcharts of USTC and iFlytek systems for Blizzard Challenge 2007. The Blizzard Challenge 2007 - Bonn, Germany, August 25, 2007 1