Corpus-based Generation of F
0
contours of Japanese based on the Generation
Process Model and its Control for Prosodic Focus
Keikichi Hirose
1
, Keiko Ochi
1
, and Nobuaki Minematsu
2
1
Department of Information and Communication Engineering, the University of Tokyo, Tokyo
2
Department of Electrical Engineering and Information Systems, the University of Tokyo, Tokyo
hirose, ochi, minematsu@gavo.t.u-tokyo.ac.jp
Abstract
A total corpus-based process of generating prosodic
features form text is developed. The process first
predicts pauses and phone durations, and then
generates F
0
contours. Since F
0
contour generation is
based on the generation process model, it is rather
easy to manipulate the generated F
0
contours in
command level. A method was developed for
generating sentence F
0
contours, when a focus is
placed in one of “bunsetsu” of an utterance. The
method is to predict differences in the F
0
model
commands between with and without focus utterances,
and applies them to the F
0
model commands predicted
beforehand by the baseline method. The validity of the
method was proved by the experiment on F
0
contour
generation and speech synthesis.
1. Introduction
Introduction of corpus-based concatenative scheme
largely improved the quality of synthetic speech to a
"close to human" level. However, the improvement is
mostly on the segmental features of speech, and, if we
view from the aspect of prosodic features, there still
remain problems to be solved. Since prosodic features
cover a range longer than phonemes, concatenation of
prosodic features in such units may cause unnatural
speech sounds; prosodic features need to be generated
by viewing a whole sentence or longer units.
Recently, in speech synthesis community, an
attention is paid to works on HMM-based speech
synthesis, where a flexible control in speech styles is
possible by adapting phone HMMs to a new style [1].
In the method, both of segmental and prosodic features
of speech are processed together in a frame-by-frame
manner, and, therefore, it has an advantage that
synchronization of both features is kept automatically
[2]. Although various styles such as attitudes and
emotions were realized with rather high quality by the
method, frame-by-frame processing of prosodic
features, however, includes some problems. It has a
merit that fundamental frequency (F
0
) of each frame
can be used directly as the training data, but, in turn, it
sometimes causes sudden F
0
undulations (not
observable in human speech) especially when the
training data are limited. As mentioned already,
prosodic features cover a wider time span than
segmental features, and should be treated differently.
From these considerations, we have developed a
corpus-based method of synthesizing F
0
contours in the
framework of the generation process model (F
0
model)
and realized speech synthesis in reading and dialogue
styles with various emotions [3, 4]. The model
represents a sentence F
0
contour as a superposition of
accent components on phrase ones; each type of
components assumed to be responses to step-wise
accent commands and impulse-like phrase commands,
respectively [5]. By predicting the model commands
instead of frame-by-frame F
0
values, a good constraint
is automatically applied on the generated F
0
contours;
still keeping acceptable speech quality even if the
prediction is done incorrectly.
When synthesizing F
0
contours, phone and syllable
boundary information is necessary. A corpus-based
method was developed also for predicting pauses and
phone durations from text input. By combining the
method with that for F
0
contour synthesis, a total
scheme was constructed to generate prosodic features
for speech synthesis from a text [6].
By handling F
0
contours in the F
0
model framework,
a clear relationship is obtainable between generated F
0
contours and their background linguistic (and para-
/non-linguistic) information, enabling “flexible”
control of prosodic features. It is rather easy to analyze
the prosodic controls obtained by statistical methods
and to modify generated F
0
contours in another corpus-
based way, which is trained using a small speech
corpus. As an example for the flexible control, we
have developed a method of focus control [7]. Given
ICSP2008 Proceedings
______________________________________
978-1-4244-2179-4/08/$25.00 ©2008 IEEE
647