Acoustic distance explains speaker versus accent normalization in infancy Paola Escudero, Karen E. Mulak & Samra Alispahic The MARCS Institute, University of Western Sydney paola.escudero@uws.edu.au, k.mulak@uws.edu.au, s.alispahis@uws.edu.au Abstract Acoustic/phonetic differences exist in cross-speaker and cross- accent speech. Young infants generally recognize speech across speakers but not across speakers of different accents. We examined how Australian English infants discriminated Dutch vowels produced by two speakers of the same accent, and by two speakers of two different accents. Acoustic analysis showed that the acoustic distance between same- vowel tokens produced by speakers of different accents was larger than between those produced by speakers of the same accent. Infants demonstrated greater difference in looking time to an accent than a speaker change, indicating that they noticed a difference in a vowel produced in a different accent more than one produced by another speaker with the same accent. This supports the hypothesis that acoustic distance underlies the relative ease in handling speaker versus accent variation. Index Terms: speaker variation, accent variation, speech normalization, language development, vowel perception 1. Introduction An utterance produced by two different people will vary considerably on acoustic/phonetic dimensions, due to individual differences in vocal tract characteristics. This variation between pronunciations is further increased when two speakers speak different regional accents of a language, as regional accents typically contain systematic variation in pronunciations. From experience, we know that adults are readily able to understand speech across both the variability seen among speakers of the same regional accent and the variability observed in regionally accented speech. But are these two types of variation handled to the same extent? Recent studies on infant populations have found that like adults, young infants aged 7.5-9 months are able to recognize words across pronunciations by speakers with similar [1] and dissimilar [2], [3] (but see [1]) voices. This ability also extends to recognition of syllables [4], [5] and vowels [6] across speakers. Notably, chinchillas, budgerigars and zebra finches have also been found to be able to recognize vowels and words across speakers [7]–[11], suggesting that this ability to normalize speech across speakers and ascertain the relevant linguistic information may be innate. Unlike adults, young infants are unable to recognize speech produced across accents. For instance, American infants familiarized to words produced by a speaker of American English did not recognize those words in passages produced by a Spanish-accented speaker of English [2], or a speaker of Canadian English [12], or vice versa Indeed, even adults’ ability to recognize speech across accents is subject to exposure. Adults were able to recognize words in Spanish- and Chinese-accented English as quickly as in non-accented English after receiving a short exposure to the accent [13], and different amounts of exposure are needed for successful processing of an accent depending on the accent and context [14]. With respect to vowels, an artificial regional accent study has shown that adults classified “wetch” as “witch” after 20 minutes of exposure to dialog that included the vowel shift of /ɪ/ to /ɛ/ [15], but did not make this classification without exposure. Thus, it seems that without previous exposure, young infants, adults, and non-human mammals are able to recognize the invariant linguistic information of speech sounds even when produced by different speakers. This suggests that this ability is pre-linguistic and innate. By contrast, the ability to recognize speech across accents is not present in young infants, and adults require pre-exposure to words in the new accent. This suggests that accent normalization is an emergent ability that occurs at the lexical level, as it requires lexical exposure. This lexical tie to resolving accent variation is also evident in older infants: Nineteen-month-olds exposed to words containing a vowel shift (e.g., “dog” shifted to “dag”) subsequently treated the shifted pronunciation as a valid label for an associated visual referent (i.e., accepted “dag” as a label for a picture of a dog; [16]). As well, 15-month-olds’ ability to recognize words across accents has also been shown to positively correlate with their expressive vocabulary score, further demonstrating the lexical link [17]. But why is it that resolving accent variation and resolving speaker variation appear to be handled in different ways? It may be that our auditory system (and that of vertebrates) is able to do away with the absolute physical differences between the productions of different speakers. In that respect, it has been shown that while there is considerable variation between the absolute F1 and F2 values of an individual vowel when produced by different speakers, this variation disappears when F1 and F2 are divided by higher formants such as F3 [18]. It may be that humans and vertebrates pay attention to these ratios rather than to absolute formant values when discriminating speech sounds, enabling rapid speaker normalization. Magneto-encephalographic studies have shown sensitivity in the auditory cortex to F1/F3 at latencies consistent with low level processing (between feature extraction and abstract processing, [19]). Thus there exists a plausible mechanism for rapid or even automatic speaker normalization within an accent. Alternatively, it may simply be the case that the acoustic variation across the productions of different speakers from the same accent is smaller than that of speakers of different accents. In that respect, many cross-accent studies have shown that variability in the production of the same vowels across accents influences cross-accent, non-native and second- language vowel perception (for a recent review see the introduction and discussion in [20]).