Proceedings of 20th International Congress on Acoustics, ICA 2010 23–27 August 2010, Sydney, Australia Four-tone modeling for natural singing synthesis in Chinese and comparing synthesized singings with speaking voices Kenko OTA (1) and Terumasa EHARA (2) (1) Faculty of Systems Engineering, Tokyo University of Science Suwa, Nagano, Japan (2) Faculty of Human Cultures, Yamanashi Eiwa College, Yamanashi, Japan PACS: 43.72.Ja, 43.75.Rs ABSTRACT Currently, many researchers work on singing synthesis in Japanese or English etc. However, there are few researches on singing synthesis in Chinese. Thus, this paper studies four-tone modeling for natural singing synthesis in Chinese. Four- tone is one of the characteristics of the Chinese syllable, which is modeled as follows: 1st tone is a horizontal linear function, 2nd tone is a linearly increasing function, 3rd tone is a quadratic function and 4th tone is a linearly decreasing function. Four types of four-tone models have been deﬁned in order to clarify an optimal four-tone model. Proposed four-tone models are controlled by a parameter which determines the changing rate of fundamental frequency. As the results of subjective evaluations, the following things have been clariﬁed about the fundamental frequency control for natural singing synthesis: 1st tone is no need to change the fundamental frequency from that of a score, the fundamental frequency of 2nd tone is controlled to change at the last half of the duration of a note and the fundamental frequency of both 3rd and 4th tones are controlled to change at the ﬁrst half of the duration of a note, and the optimal changing rate for 2nd, 3rd and 4th tones are 1.5%, 1.0% and 1.5% respectively. In this paper, the changing rate of fundamental frequency for singing voices synthesized by the above-mentioned system is compared with that for speaking and singing voices. Firstly, the changing rate of speaking voices in Chinese is calculated. It can be seen that the changing rate for each tone varies widely in individuals. However, the trend of changing rate among 2nd∼4th tones is similar to each speaker. Secondly, the changing rate of real singing voices in Chinese is calculated. It seems that the changing rate of a singing voice is similar to optimal parameter values for singing synthesis except 3rd tone. Moreover, it has been clariﬁed that the changing rate of a singing voice depends on the level of singing. It seems that the changing rate of good singers has a tendency to be smaller than that of poor singers. Thirdly, the similarity between synthesized singings and real singings by Chinese is investigated by comparing the fundamental frequency contour of synthesized singings with that of real singings. It seems that the synthesized singing voice is closed to the real singing voice of good singers. INTRODUCTION Currently, many researchers work on singing synthesis tech- niques and there are fundamental researches on singing voices or applied researches for software products[(1)]. Singing syn- thesis techniques can be classiﬁed into two types. One is corpus- based techniques[(2)][(3)], and the other one is techniques which synthesize a singing voice from a speaking voice[(4)][(5)]. Although corpus-based techniques are highly practicable, there are some defects denoted as follows. It is necessary to record enormous amount of singing voices in order to develop a cor- pus. Moreover, individuality of synthesized singing voices is lost. These techniques have been applied to singing synthesis of Japanese or English songs. However, there are few tech- niques for singing synthesis of Chinese songs. On the other hand, techniques which synthesize a singing voice from a speaking voice can keep the individuality of a speaker. Saito et al. have been proposed one of these techniques. Kubo et al. have been proposed a technique for synthesizing Chinese singing voices. However, the study by Kubo et al. has not con- sidered four-tone which is one of the characteristics of Chinese syllable. Hence, synthesized singing voices could not be heard as natural Chinese singings. Authors have been proposed a singing synthesis technique in Chinese[(6)][(7)]. In this paper, these results are brieﬂy in- troduced. Moreover, synthesized singing voices are compared with both speaking and singing voices. The rest of this paper consists of the following four sections. In the section of “Related researches”, related researches are introduced and the position of our research is clariﬁed. In the section of “Singing synthesis system in Chinese”, the overview of our singing synthesis system and four-tone models are de- noted. In the section of “Comparison of synthesized singings with speaking and singing voices”, synthesized singings are compared with speaking and singing voices. Finally, in the sec- tion of “Conclusion”, this paper is concluded. RELATED RESEARCHES Vocaloid Vocaloid is one of the corpus-based singing synthesizers. Vocaloid can synthesize arbitrary singing voices by inputting notes and lyrics. The corpus named “singer library” contains samples ex- tracted from enormous singing voices by voice actors. Vocaloid employs a technique which can smoothly concatenate samples, so it can realize natural singing synthesis. Currently, however Vocaloid can treat Japanese and English songs, it cannot treat ICA 2010 1