GSLT: Speech Technology 1 Susanne Schötz Term paper Autumn 2002 1 Linguistic & Paralinguistic Phonetic Variation in Speaker Recognition & Text-to-Speech Synthesis Susanne Schötz (susanne.schotz@ling.lu.se) Department of Linguistics and Phonetics, Lund University ABSTRACT Phonetic variation, and especially prosodic variation, which is often paralinguistic in nature has gradually attracted more attention among speech researchers and speech scientists as one of the possible solutions to problems with automatic speaker recognition (ASrR) and text-to-speech synthesis (TTS) systems. This paper presents a brief overview of approaches to phonetic variation in ASrR and TTS, beginning with attempts to classify linguistic and paralinguistic phenomena in speech. Also, some of the problems related to paralinguistic phonetic variation and attempted solutions are discussed. 1 Introduction One of the major obstacles to overcome when trying to improve existing speaker recognition and text- to-speech systems is related to prosody. The prosody models of today are still far from perfect, and as paralinguistic information in speech is mainly signalled with prosodic cues, the systems of today are also unable to effectively and reliably recognize and generate speaker specific qualities like age, sex, emotions and attitudes. Another problem for speech researchers is related to the confusing terminology associated with prosody. Linguists have made a distinction between linguistic and paralinguistic, while phoneticians traditionally have preferred to draw the line between segments (vowels and consonants) and prosody. In this paper some of the approaches to the problems associated with linguistic, prosodic and paralinguistic phonetic variation are presented and discussed. Of the following sections, 3.1 and 4.1 are based mainly on Furui (1996, 1997) and Gish & Schmidt (1994), while sections 3.2 and 4.2 are based on Dutoit (1997), Klatt (1987) and Carlson & Granström (1997). 2 Classifications of phonetic variation There are a number of ways to classify phonetic variation in speech, and it is not always agreed upon what the different categories are and which aspects they should include. Obviously, different typologies are created for different purposes, but categories often overlap and several features may belong to more than one category. This section presents some of the different classes that have been used for describing phonetic and paralinguistic variations, and also provides examples of the various categories. 2.1 Linguistic or paralinguistic A distinction typically made by linguists and many speech researchers is one that divides speech into linguistic information, i.e. the arbitrary language code used intentionally by the speaker for communication on one hand, and all other information on the other. Speech signals necessarily contain other information besides linguistic. Such information varies as a function of the speaker, the listener/s and the communicative situation, and is referred to as paralinguistic, extra-linguistic or non-linguistic in the literature. In Saussure’s terminology paralinguistic phenomena would rather be ‘parole’ than ‘langue’ (Traunmüller 2001). Roach et al (1998) define paralinguistic features as those used intentionally by the speaker, and non-linguistic features as those that cannot be used intentionally, such as age, sex, state of health etc. Non-linguistic features are further classified into individual variation, due to the physiology (size, weight) and histology (age) of the vocal tract, which affect the phonation and resonance of the speech, and reflexes, that are involuntary reactions to an emotional state and include clearing the throat, sniffs, yawns, laughs, cries and audible breathing.