TRAINING A SUPRA-SEGMENTAL PARAMETRIC F0 MODEL WITHOUT INTERPOLATING F0 Javier Latorre † ,Mark J.F. Gales † ,Kate Knill †* , Masami Akamine ‡ † Toshiba Research Europe Ltd., Cambridge Research Laboratory, Cambridge, UK ‡ Toshiba Corporate Research & Development Center, Kawasaki, Japan javier.latorre@crl.toshiba.co.uk ABSTRACT Combining multiple intonation models at different linguistic lev- els is an effective way to improve the naturalness of the predicted F0. In many of these approaches, the intonation models for supra- segmental levels are based on a parametrization of the log-F0 con- tours over the units of that level. However, many of these parametri- sations are not stable when applied to discontinuous signals. There- fore, the F0 signal has to be interpolated. These interpolated values introduce a distortion in the coefﬁcients that degrades the quality of the model. This paper proposes two methods that eliminate the need for such interpolation, one based on regularization and the other on factor analysis. Subjective evaluations show that, for a Discrete- cosine-transform (DCT) syllable-level model, both approaches re- sult in a signiﬁcant improvement w.r.t. a baseline using interpolated F0. The approach based on regularization yields the best results. Index Terms: speech synthesis, intonation, factor analysis, regular- ization, F0 interpolation 1. INTRODUCTION Intonation is the temporal variation of pitches. It is an essential part of speech for all human languages, which use it to encode a variety of information such as the type of sentence (question,statement), word emphasis, discourse structure, etc. Most of the information encoded in the intonation is supra-segmental. This means that its structures are at a linguistic level higher than the phone. In that sense, intona- tion should be considered to be continuous and smooth, at least over the time scales deﬁned by those supra-segmental structures [1]. A problem to create an intonation model is that the pitch is a subjective psychoacoustical property of sound which cannot be ob- tained directly from the waveform. Instead, the fundamental fre- quency (F0) is used as its closest measurable proxy. However, F0 does not exist or is unobservable for unvoiced phones. Therefore, the observed F0 trajectory is usually discontinuous for whichever supra-segmental structure. In standard HMM-based synthesis this problem is avoided by modelling directly the observed discontinuous log-F0 at a sub- segmental level by means of multi-space distributions (MSD) [2]. In an MSD, the log-F0 signal is assumed to be either a random vari- able sampled from a 1-dimensional distribution for voiced frames, or a 0-dimensional symbol for unvoiced ones. At synthesis time, the prior probability of these two spaces is used to classify each frame into voiced and unvoiced. A continuous F0 trajectory is then gen- erated for each sub-section of voiced frames using the standard pa- rameter generation algorithm [3]. In the original full HMM-based * Kate Knill is currently at Cambridge University. TTS [4] the predicted F0 had to be discontinuous because it was also used to control the pulse/noise switch excitation model. Nowadays, most HMM-based TTS system uses a more sophisticated excitation scheme in which the voicing does not depends only on the predicted F0 values. Those systems perform better when the voicing is con- trolled by the frequency-dependent soft-decision provided by the ex- citation parameters rather than by the frequency-independent hard- decision of the predicted discontinuous F0 [5]. Moreover, relieving F0 from any responsibility regarding the voicing allows treating it as a continuous signal, thus improving the intonation model [6, 7, 8, 9] Another problem of the standard MSD model is that each voiced section is generated independently. Supra-segmental structures are ignored or at most considered only implicitly via the decision tree used to select the models. A proposed method to generate F0 using explicit supra-segmental information consists of obtaining the log- F0 contour that maximizes the weighted sum of log-likelihoods of several intonation models, each at a different linguistic level [10, 11]. This approach can produce a better intonation than a standard state- based MSD model [10, 11, 12]. The supra-segmental model con- sists of distributions of a ﬁxed-order parametrisation of the log-F0 contour at that level. However, some parametrizations are unstable when applied to discontinuous signals. The standard way to deal with this problem is to interpolate log-F0 [13], usually by a linear or a spline function with a window of one or two frames before and after the unvoiced gap. This causes two new problems. First, F0 values close to the unvoiced regions are often unreliable. Therefore, the values interpolated from them are unreliable too [7]. Second, in- terpolated values rarely follow the ’natural’ contour of the data. As a result, they introduce a distortion in the coefﬁcients which might affect the model [14]. It is possible to avoid this by computing the parametrisation only over continuous F0 sections [15, 11]. However, this makes building statistical models harder, because the meaning of the coefﬁcients depend completely of the underlying phonetic struc- ture. For example, two phonetically different syllables, e.g., ’big’ and ’pick’, pronounced with the same intonation might have differ- ent coefﬁcients because the boundaries of their voiced sections are different. This paper investigates two approaches to obtain parametrisa- tion coefﬁcients over whole linguistic units without interpolating F0. The rest of the paper is organized as follows. Section 2 reviews the parametric F0 approach and its similarities to other continuous F0 models. Section 3 introduces the two proposed method to avoid in- terpolation. Section 4 shows the result of a subjective experiment. A possible explanation for these results is discussed in section 5. Finally, conclusions are drawn in 6.