FUNDAMENTAL FREQUENCY MODELING FOR CORPUS-BASED SPEECH SYNTHESIS BASED ON A STATISTICAL LEARNING TECHNIQUE Shinsuke Sakai and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA {sakai,glass}@mit.edu ABSTRACT This paper proposes a novel two-layer approach to funda- mental frequency modeling for concatenative speech syn- thesis based on a statistical learning technique called addi- tive models. We define an additive F 0 contour model con- sisting of long-term, intonational phrase-level, component and short-term, accentual phrase-level, component, along with a least-squares error criterion that includes a regular- ization term. A backf itting algorithm, that is derived from this error criterion, estimates both components simultane- ously by iteratively applying cubic spline smoothers. When this method is applied to a 7,000 utterance Japanese speech corpus, it achieves F 0 RMS errors of 28.9 and 29.8 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.81 and 0.77. The automatically determined intonational and accentual phrase components behave smoothly, systematically, and intuitively under a va- riety of prosodic conditions. 1. INTRODUCTION In recent years, corpus-based concatenative methods for speech synthesis have received increasing attention within the research community as well as the speech technology industry, because of their ability to generate natural sound- ing speech output [1, 2]. In general, for synthesized speech to be natural and intelligible, it is crucial to have a proper F 0 contour that is compatible with linguistic information such as lexical accent (or stress) and phrasing in the input text. In the corpus-based concatenative speech synthesis setting, target F 0 features (e.g., mean frequency, dynamic range) are generated for each synthesis unit. Distance metrics can then be used to compute a cost between the unit target val- ues, and those available in a speech corpus. Overall cost is minimized during search to find the best matching sequence of synthesis units from the corpus. In some systems, F 0 target is predicted by an independent rule-based front-end This research was supported in part by NTT. [3], while regression tree-based approaches are often used to predict F 0 -related measures from a set of linguistic fea- tures [4, 5]. A regression tree approach is advantageous in that it is simple to implement yet powerful. It has a few drawbacks, however. For example, the predicted values do not have a smooth contour, since it essentially represents a piecewise constant function of the input features. In this work, we propose a simple yet novel two-layer additive model [6, 7] approach to F 0 contour prediction, and a method to estimate the component functions through the minimization of a residual sum-of-squares error crite- rion that includes a regularization term. In the following section we define the additive F 0 model, along with the pe- nalized least-squares criterion from which a backfitting al- gorithm is derived as the minimizer of the criterion. We then describe experimental results applying the proposed method to a large corpus of Japanese speech. 2. ADDITIVE MODEL APPROACH The basic formulation for the F 0 contour is similar to pre- vious work, e.g., [8, 9]. In this approach, the F 0 contour, Y , is the output of a statistical model that combines a long- range intonational-phrase level component, g, and a shorter accentual-phrase level component, h: Y = α + g(I,U )+ h(A, V )+ ǫ = α + g I (U )+ h A (V )+ ǫ, (1) where α is a constant, I is a discrete-valued (i.e., symbolic) input variable that represents a type of intonational phrase, and indexes the relevant function g I . U is a continuous vari- able representing a time point relative to the starting point of the phrase of type I . Similarly, discrete variable A des- ignates a type of accentual phrase, and V represents a time point relative to the starting point of the accentual phrase of type A. The random error term, ǫ, is zero mean. Fig- ure 1 shows how the three terms form the entire F 0 contour function. In Proceedings of IEEE ASRU 2003, Nov. 30-Dec. 4, 2003, St. Thomas, U.S. Virgin Islands, pp.712–717 c IEEE 2003