Corpus-based Generation of F 0 contours of Japanese based on the Generation Process Model and its Control for Prosodic Focus Keikichi Hirose 1 , Keiko Ochi 1 , and Nobuaki Minematsu 2 1 Department of Information and Communication Engineering, the University of Tokyo, Tokyo 2 Department of Electrical Engineering and Information Systems, the University of Tokyo, Tokyo hirose, ochi, minematsu@gavo.t.u-tokyo.ac.jp Abstract A total corpus-based process of generating prosodic features form text is developed. The process first predicts pauses and phone durations, and then generates F 0 contours. Since F 0 contour generation is based on the generation process model, it is rather easy to manipulate the generated F 0 contours in command level. A method was developed for generating sentence F 0 contours, when a focus is placed in one of “bunsetsu” of an utterance. The method is to predict differences in the F 0 model commands between with and without focus utterances, and applies them to the F 0 model commands predicted beforehand by the baseline method. The validity of the method was proved by the experiment on F 0 contour generation and speech synthesis. 1. Introduction Introduction of corpus-based concatenative scheme largely improved the quality of synthetic speech to a "close to human" level. However, the improvement is mostly on the segmental features of speech, and, if we view from the aspect of prosodic features, there still remain problems to be solved. Since prosodic features cover a range longer than phonemes, concatenation of prosodic features in such units may cause unnatural speech sounds; prosodic features need to be generated by viewing a whole sentence or longer units. Recently, in speech synthesis community, an attention is paid to works on HMM-based speech synthesis, where a flexible control in speech styles is possible by adapting phone HMMs to a new style [1]. In the method, both of segmental and prosodic features of speech are processed together in a frame-by-frame manner, and, therefore, it has an advantage that synchronization of both features is kept automatically [2]. Although various styles such as attitudes and emotions were realized with rather high quality by the method, frame-by-frame processing of prosodic features, however, includes some problems. It has a merit that fundamental frequency (F 0 ) of each frame can be used directly as the training data, but, in turn, it sometimes causes sudden F 0 undulations (not observable in human speech) especially when the training data are limited. As mentioned already, prosodic features cover a wider time span than segmental features, and should be treated differently. From these considerations, we have developed a corpus-based method of synthesizing F 0 contours in the framework of the generation process model (F 0 model) and realized speech synthesis in reading and dialogue styles with various emotions [3, 4]. The model represents a sentence F 0 contour as a superposition of accent components on phrase ones; each type of components assumed to be responses to step-wise accent commands and impulse-like phrase commands, respectively [5]. By predicting the model commands instead of frame-by-frame F 0 values, a good constraint is automatically applied on the generated F 0 contours; still keeping acceptable speech quality even if the prediction is done incorrectly. When synthesizing F 0 contours, phone and syllable boundary information is necessary. A corpus-based method was developed also for predicting pauses and phone durations from text input. By combining the method with that for F 0 contour synthesis, a total scheme was constructed to generate prosodic features for speech synthesis from a text [6]. By handling F 0 contours in the F 0 model framework, a clear relationship is obtainable between generated F 0 contours and their background linguistic (and para- /non-linguistic) information, enabling “flexible” control of prosodic features. It is rather easy to analyze the prosodic controls obtained by statistical methods and to modify generated F 0 contours in another corpus- based way, which is trained using a small speech corpus. As an example for the flexible control, we have developed a method of focus control [7]. Given ICSP2008 Proceedings ______________________________________ 978-1-4244-2179-4/08/$25.00 ©2008 IEEE 647