Cognition 211 (2021) 104619
0010-0277/© 2021 Elsevier B.V. All rights reserved.
Encoding and decoding of meaning through structured variability in
intonational speech prosody
Xin Xie
a, 1, *
, Andr´ es Bux´ o-Lugo
b, 1, *
, Chigusa Kurumada
a
a
Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, USA
b
Department of Psychology, University of Maryland, College Park, MD 20742, USA
A R T I C L E INFO
Keywords:
Prosody
Meaning
Intonation
Adaptation
Language production
Language comprehension
Variability
ABSTRACT
Speech prosody plays an important role in communication of meaning. The cognitive and computational
mechanisms supporting this communication remain to be understood, however. Prosodic cues vary across talkers
and speaking conditions, creating ambiguity in the sound-to-meaning mapping. We hypothesize that listeners
ameliorate this ambiguity in part by learning talker-specifc statistics of prosodic cues. To test this hypothesis, we
investigate the production and recognition of question vs. statement prosody in American English. Experiment 1
elicits productions of questions and statements from 65 talkers to examine the distributional statistics charac-
terizing within- and cross-talker variability in these productions. We use Bayesian ideal observer models to assess
the predicted consequences of cross-talker variability on listeners’ recognition of prosody. We fnd that learning
of talker-specifc distributional statistics is predicted to facilitate recognition, above and beyond what can be
achieved via commonly assumed normalizations of prosodic cues. Experiment 2 tests this prediction in a
comprehension experiment. We expose different groups of listeners to different prosodic input statistics and
assess listeners’ recognition of questions and statements both prior to, and following, exposure. Prior to exposure,
ideal observer-derived predictions based on Experiment 1 provide a good qualitative ft against listeners’
recognition of prosodic contours in Experiment 2. Following exposure, listeners shift the categorization boundary
between questions and statements in ways consistent with learning of talker-specifc statistics.
1. Introduction
Prosody—the rhythm and cadence of speech—plays a critical role in
the communication of meaning. Subtle differences in utterance-fnal
intonation contours, for instance, change an utterance’s meaning from
a statement (e.g., It’s raining. [falling intonation]) to a question (e.g., It’s
raining? [rising intonation]). There is a rich evidence base indicating
that listeners recognize such meaning-distinguishing prosodic categories
(Bolinger, 1989; Gussenhoven, 2002; (Ladd, D Robert, 2008); Pierre-
humbert and Hirschberg, 1990) and integrate the meaning as an utter-
ance unfolds (Cutler, 2015; Dahan, 2015; Ito and Speer, 2008; Weber
et al., 2006). However, the cognitive and perceptual mechanisms sup-
porting this recognition remain poorly understood.
One major source of diffculty stems from variability in the prosodic
signal across talkers and contexts (Arvaniti, 2019; Brugos et al., 2006;
Cangemi et al., 2015; Cangemi and Grice, 2016; Cole, 2015). Continuing
on the case of statements vs. questions in American English, the exact
form and level of the rise produced to signal a question meaning can
vary across talkers as well as talker groups (e.g., age, gender, dialect)
(Arvaniti and Garding, 2007; Clopper and Smiljanic, 2011). For
example, due to diffculties in controlling their pitch, young children
tend to produce a smaller degree of a rise than older children (Patel and
Grigos, 2006). Also, rising intonation can be used to signal other,
including social, meanings (e.g., ‘uptalk’, Warren, 2016). As a result of
this talker variability, one person’s production of a statement and
another person’s production of a question can be phonetically identical.
The present study explores how listeners may navigate this “lack of
invariance” in the realization of prosody. Although talker variability in
speech acoustics has been an issue central to speech perception research
(e.g., Hillenbrand et al., 1995; Newman et al., 2001; Theodore et al.,
2009), relevant accounts for how listeners may cope with the variability
focus almost exclusively on segmental (as opposed to prosodic) speech
* Corresponding authors at: Department of Brain and Cognitive Sciences, Meliora Hall, University of Rochester, Rochester, NY 14627, and Department of Psy-
chology, University of Maryland, Biology/Psychology Building, 4094 Campus Dr., College Park, MD 20742, United States.
E-mail addresses: xxie13@ur.rochester.edu (X. Xie), buxolugo@umd.edu (A. Bux´ o-Lugo).
1
The frst two authors contributed equally.
Contents lists available at ScienceDirect
Cognition
journal homepage: www.elsevier.com/locate/cognit
https://doi.org/10.1016/j.cognition.2021.104619
Received 1 August 2020; Received in revised form 25 November 2020; Accepted 27 January 2021