ANALYSIS OF SESOTHO TONE USING THE FUJISAKI MODEL Lehlohonolo Mohasi 1 , Hansjörg Mixdorff 2 , Thomas Niesler 1 , Sabine Zerbian 3 1 University of Stellenbosch, South Africa; 2 Beuth University of Applied Sciences Berlin, Germany; 3 University of Potsdam, Germany lmohasi@sun.ac.za ; mixdorff@beuth-hochschule.de ; trn@sun.ac.za ; szerbian@uni-potsdam.de Abstract In this paper, two approaches that can be used to determine the tonal pattern of sentences in Sesotho are compared: surface tone transcription and the Fujisaki model. The tone commands of the latter technique, which represent high tones, are compared with the high surface tones predicted by the tone rules. The mismatched syllables are investigated in order to account for the discrepancies, and particular attention is given to the influence of the adjacent syllables on the tone of the target syllable. Results reveal that the discrepancies are in many cases due to minor errors in the Fujisaki model such as the effects of microprosody, or are due to inconsistent surface tone prediction. An investigation into the prosodic groups formed by the tone commands found that these sequences are mainly due to two or more adjacent syllables with high tone labels, and sometimes due to alternating tone labels between the adjacent syllables. Index Terms: Fujisaki model, Sesotho tone, Sesotho TTS 1. Introduction In order for text-to-speech (TTS) systems to produce intelligible and natural-sounding speech, accurate prosodic modelling is crucial. Prosodic features include the fundamental frequency (F0) contour, duration, pause and amplitude. Tone, on the other hand, is a linguistic property marked by prosodic features such as F0 and intensity. Due to the absence of prosodic marking in the written format, prosodic modeling is a challenge for tonal Bantu languages such as Sesotho [1]. In this paper, we investigate and compare two methods by means of which the tonal pattern of sentences in Sesotho can be determined. The first method is referred to as surface tonal transcription. This method is a tone labelling algorithm based on underlying (lexical) tones, as well as a set of tonal rules, a pronunciation dictionary and morphological analysis. The tonal rules applied are those described in the literature by Khoali [2]. The second method employs the Fujisaki model [3] and is reliant on the acoustics of the uttered speech. The Fujisaki model is a manageable and powerful model for prosody manipulation. It has shown a remarkable effectiveness in modelling the fundamental frequency (F0) contours and its validity has been tested for several languages [4, 5, 6, 7], including tonal languages such as Mandarin [8] and Thai [9]. The Fujisaki model decomposes the F0 contour extracted from the audio samples into three components: a base frequency, a phrase component, which captures slower changes in the F0 contour as associated with intonation phrases, and an accent component that reflects faster changes in F0 associated with high tones. The accent commands of the Fujisaki analysis, which for tone languages are usually referred to as tone commands, are an indicator of high tones in the utterance. Sesotho has 2 tones – high (H) and low (L) and in previous work [10, 11], it was found that the Fujisaki captures tone commands of positive amplitude for the high tones. For other tonal languages that have been investigated using this technique, such as Mandarin [8], Thai [9], and Vietnamese [12], low tones are captured by tone commands of negative polarity. In contrast, low tones in Sesotho were found to be associated with the absence of tone commands. The objective of this paper is to investigate the relationship between the surface tone, which is computed using the lexical tones, the morphology and a set of tone rules known from the literature, and the tone commands as determined by the Fujisaki model. We are particularly interested in how closely related the two predictions are, and how they compare, with the perceived tone. The ultimate goal is to develop a technique that is able to predict the tone commands based on the surface tones. This will be an important step in the development of a computational model for tone, which will be essential in a Sesotho text-to-speech system. Section 2 gives a brief background on tonal transcription and the Fujisaki model. Section 3 discusses the compilation of the corpus, the transcription for the surface and perceived tones, and the decomposition of the Fujisaki model into its parameters. Section 4 gives the results and analysis of the two cases being investigated, while Section 5 draws a conclusion from these results. 2. Background Sesotho is classified as a grammatical tone language, which means that words may be pronounced with varying tonal patterns depending on their particular function in a sentence. In order to create certain grammatical constructs, tone rules may modify the underlying tones of the word and thus lead to differing surface tones. The underlying tone is the tonal pattern of the word in isolation and may be obtained from a tone-marked dictionary. The surface tone is derived from the underlying tone using tone rules, and is the tone given to a word when spoken as part of the sentence. We will indicate underlying high tones by underlining, and surface high tones by acute accents. For example, in the phrase Matsatsí á mabé li á látélańg. “The next two days.” tsí , á and bé and á have both underlying and surface high tones, while lá, té, and ńg have high surface tones only. Whereas the surface tone is determined using a set of tone rules, the Fujisaki model analyses the F0 contour of a natural utterance and decomposes it into a set of basic components which, together, lead to the F0 contour that closely resembles the original. This method was first proposed by Fujisaki and his co-workers in the 70s and 80s [13] as an analytical model which describes fundamental frequency variations in human