Orofacial somatosensory inputs enhance speech intelligibility in noisy environments Rintaro Ogane 1 , Jean-Luc Schwartz 1 , Takayuki Ito 1,2 1 Univ. Grenoble Alpes, CNRS, Grenoble INP*, GIPSA-lab, 38000 Grenoble, France 2 Haskins Laboratories, CT, USA * Institute of Engineering Univ. Grenoble Alpes {rintaro.ogane, jean-luc.schwartz, takayuki.ito}@gipsa-lab.grenoble-inp.fr Abstract Noise in speech communication reduces intelligibility and makes it more difficult for the listener to detect the talker's utterances. In such noisy environments, other sensory inputs coming with auditory inputs can help to increase speech sound intelligibility. For example, seeing the speaker's facial movements aids the perception of speech sounds in noise (Grant and Seitz, 2000; Kim and Davis, 2004; Sumby and Pollack, 1954). Recent findings have demonstrated that somatosensory information associated with facial skin deformation also intervenes in speech perception (Ito et al., 2009; Ogane et al., 2019, 2020). While the effect of somatosensory stimulation was only assessed in quiet environments, somatosensory inputs might also increase the intelligibility of speech sounds in noisy environments. The current experiment examined whether orofacial somatosensory inputs facilitate the detection of speech sounds in noise. We carried out a test to evaluate the detection threshold of speech sounds in noise and examined whether this threshold was decreased when the sound was accompanied with somatosensory stimulation. Moreover, we examined whether somatosensory stimulation provides just a temporal clue for accurate detection or includes more specific articulatory information related to auditory stimulation. For this aim, we compared different types of auditory stimuli, varying in terms of articulatory compatibility with the somatosensory stimulation. Twenty-eight native French speakers participated in the experiment. We focused on two French speech sounds, /pa/ and /py/ respectively associated with vertical (jaw opening) and horizontal (lip rounding) articulatory gestures. Both stimuli were recorded by a male native French speaker. The intensity levels for both stimuli were adjusted to be equal. Each speech sound was tested in a separate group. The participants were randomly assigned to either of the two groups. During the test, a 1-s white noise sound was presented twice in sequence with an inter-stimulus interval of 250 ms. The speech stimulus (/pa/ or /py/ depending on the group) was embedded inside either of the two noise stimuli. Participants were asked to identify which noise interval included the speech sound by pressing a key as quickly as possible. The amplitude of the noise was fixed at 80 dB SPL. We tested 8 signal-to-noise ratio levels by modifying the amplitude of the target speech sound from -8 dB to -15 dB for /pa/, and from -10 dB to -17 dB for /py/ (values selected after a pilot experiment). The onset of the speech sound in the corresponding noise interval was randomly set to either 200 or 600 ms after noise onset. The auditory stimulation was presented through headphones. Somatosensory stimulation associated with facial skin deformation was produced using a robotic device in a vertical direction with a 6 Hz half-sinusoidal pattern providing a 167 ms stimulation duration. The peak timing of the somatosensory stimulation was adjusted at the peak amplitude of the target speech sound. The stimulus was applied in both noise intervals whatever the interval containing the speech sound to detect. We tested two experimental conditions: a pure auditory condition and a condition with somatosensory stimulation. These two conditions were alternated every 8 trials. In total, 320 stimuli (eight SNR levels × 20 occurrences per SNR level × two experimental conditions) were presented in a pseudo-randomized order. For data analysis, the percentage of correct detection response was obtained at each SNR level. We compared the average correct detection score across SNR levels between the two experimental conditions. One-way ANOVA with repeated-measures was applied to each participant group separately since the two groups displayed clearly different variances. For /pa/, there was a significant 3% difference between auditory alone and auditory-somatosensory conditions (0.73 ± 0.01 vs. 0.76 ± 0.01 on average ± standard error, F(1, 13) = 5.44, p < 0.04). On