Speaker-specific structure in German voiceless stop voice onset times Marc A. Hullebus 1 , Stephen J. Tobin 12 , Adamantios I. Gafos 12 1 Department für Linguistik, Universität Potsdam, Germany 2 Haskins Laboratories, New Haven, USA hullebus@uni-potsdam.de, tobin@uni-potsdam.de, gafos@uni-potsdam.de Abstract Voice onset time (VOT), a primary cue for voicing in many languages including English and German, is known to vary greatly between speakers, but also displays robust within- speaker consistencies, at least in English. The current analysis extends these findings to German. VOT measures were investigated from voiceless alveolar and velar stops in CV syllables cued by a visual prompt in a cue-distractor task. Comparably to English, a considerable portion of German VOT variability can be attributed to the syllable’s vowel length and the stop’s place of articulation. Individual differences in VOT still remain irrespective of speech rate. However, significant correlations across places of articulation and between speaker-specific mean VOTs and standard deviations indicate that talkers employ a relatively unified VOT profile across places of articulation. This could allow listeners to more efficiently adapt to speaker-specific realisations. Index Terms: speech production, speech variability, voice onset time 1. Introduction Our perceptual systems manage to categorise speech sounds with relative ease, despite the seemingly unstructured variability in natural speech production. Many characteristics of individual speech are paralinguistic and serve to facilitate talker identification. However, even acoustic parameters that are indicative of contrastive speech categories vary to a remarkable degree both across and within speakers. Spectral characteristics such as formant frequencies have been shown to vary irrespective of phonemic context and overall pitch [1]. Durational parameters, specifying differences in speech properties relating to time, are also not immune to variability. One of the most extensively studied durational parameters conveying linguistic contrasts is voice onset time (VOT) [2]. In many languages, VOT serves as a primary cue for the voicing contrast in stops (e.g. voiced versus voiceless). VOT is defined as the time in milliseconds between the burst release of the stop and the onset of periodicity indicating vocal fold vibration for the vowel. The relative timing of these two events is crucial for distinguishing stop categories, with longer VOT values mostly associated with voiceless stops and shorter or negative VOTs with voiced stops. However, VOT values can differ widely between speakers [3], with mean VOT ranges for an English velar stop for instance ranging from 52 to 80 ms across studies [4], [5]. Voiceless stops in particular tend to display more variability [1], which could possibly be attributed to the lack of a contrastive category boundary in the high VOT range as opposed to the presence of a lower boundary where the threshold for the voiced category lies. Another reason might be that the diminishing synchronicity of oral and laryngeal gestures associated with longer aspiration is associated with greater variability [6]. Despite this variability, VOT is also known to be influenced systematically by a number of well-known parameters. These can reduce some of the apparent randomness of the variability and expose some of its underlying structure. A well-known effect is that of place of articulation (PoA), with VOT generally increasing with backer PoAs [5]. Another variable influencing durational properties (as well as many other speech characteristics) is speech rate [7]. VOT is also greatly influenced by speech rate, which for sentences is often measured as the number of syllables per second. For single syllables this can be measured as the duration of the vowel. As more rapid speech leads to a compressed time frame, voice onset times will necessarily be shortened [7]. Another possible factor is hyperarticulation - a phenomenon hard to exclude in experimental settings. Hyperarticulation can result in longer speech planning times which in turn can translate into longer aspiration and consequently longer voice onset times [8]. Further potential influences are prosodic context [9] and sex differences [10]. Nevertheless, even when taking factors such as speaking rate into account, there is still much individual speaker variability that remains [11], [12]. For instance, the PoA effect does not appear as clear-cut with respect to alveolar and velar stops [13]. Although both have consistently longer VOT than bilabial stops, alveolars sometimes rank higher than velars in VOT length. As much as speakers might differ with regards to their VOT productions, there are indications that they do behave relatively consistent with respect to their VOT productions when looking at a single speaker’s stop categories. Previous studies such as those of Zlatin [14], Koenig [15] demonstrated significant correlations among VOT distributions of stop categories at different PoAs, suggesting that stops across different places of articulation are produced in similar ways by single speakers. Although speakers might produce relatively consistent VOTs across stop categories, these VOTs can be influenced by situational factors as well, such as the speech of their interlocutor. Several studies have found that speakers adjust their speech production in response to the phonetic realisation of speech that they hear. These adjustments, called “phonetic accommodation”, have been observed in different phonetic parameters, including VOT. Aside from the effects of long- term exposure to different VOT ranges [16], [17], speakers tend to produce longer VOTs shortly after hearing words with longer VOTs and even generalize the lengthening of VOTs to different PoAs from those words that they have just heard [17]. The effect of such a change has often been found to be on the order of 5 ms [6], [16], [17]. Newer evidence suggests that interactions between perception and production can occur Interspeech 2018 2-6 September 2018, Hyderabad 1403 10.21437/Interspeech.2018-2288