How the slope of the speech spectrum affects the perception of speaker size Kodai Yamamoto 1 , Toshio Irino 1 , Ryuichi Nisimura 1 , Hideki Kawahara 1 , Roy D. Patterson 2 1 Graduate School of Systems Engineering, Wakayama University, 2 CNBH, Department of Physiology, Development, and Neuroscience, University of Cambridge {s155051,irino,nisimura,kawahara}@sys.wakayama-u.ac.jp, rdp1@cam.ac.uk Abstract We performed a behavioral experiment to demonstrate the ef- fect of spectral slope on the perception of speaker size, and we developed an auditory model based on the dynamic compres- sive gammachirp ﬁlterbank (dcGC-FB) to explain the results. STRAIGHT was used to generate “unvoiced” and “whispered” versions of naturally recorded words; the only difference was that the spectral slope of the whispered words was tilted up 6 dB/octave with respect to that of the unvoiced words. The ex- periment conﬁrmed that the whispered words are heard to come from smaller speakers. The auditory model uses the tonotopic excitation pattern, Ep, as the internal representation of speech sounds. The model is found to be much more effective when the gradient of the excitation pattern, ∇Ep, is included in the size discrimination process. It is particularly useful for explaining individual subject variability. Index Terms: size perception, scale processing, excitation pat- tern, gammachirp auditory ﬁlterbank 1. Introduction Some time ago, Irino and Patterson [1] showed how the au- ditory system might segregate the acoustic features in speech sounds associated with vocal tract shape from those associated with vocal tract length (VTL), and thereby produce an internal representation of speech sounds that is speaker-size invariant. Subsequently, the speech vocoder, STRAIGHT[2, 3], was used to manipulate the VTL features of natural speech sounds and show that humans are very good at discriminating speaker size, using either voiced[4, 5] or unvoiced[6] speech. The experi- ments showed that the just noticeable difference (JND) in VTL for speaker size is about 7% for vowels[4] and about 5% for syl- lables or words; for comparison, the JND for loudness is about 11%. Two of these studies [4, 6] also showed that speech recog- nition performance was largely unaffected by speaker size even when it was extended well beyond the normal range. All of these size discrimination experiments [4, 5, 6] em- ployed a simple two-interval, forced-choice paradigm (2IFC). A short sequence of vowels, syllables or words was randomly chosen for two voices that differed primarily in VTL, and af- ter listening to both sequences, the subject simply indicated which interval had the speech of the smaller speaker. Over trials, the VTL difference was varied to determine the differ- ence that supported 76% correct performance (deﬁned to be the JND). The unvoiced speech sounds were produced from voiced speech recordings by substituting noise for glottal pulses in the resynthesis stage of the vocoding. Two versions were produced, one using a noise with a ﬂat spectrum and one where the spec- trum rose with frequency by 6 dB/octave. The latter version has more of the hiss we associate with whispered speech, and so, these stimuli were referred to as “whispered speech”; the stimuli resynthesized with the ﬂat spectrum noise were referred to as “unvoiced speech”. In all of the experiments, the mode of voicing (voiced, unvoiced or whispered) was the same in the two intervals of a trial, and the sound levels were equated in rms terms. Subsequently, informal listening suggested that the “whis- pered speakers ’’ sounded somewhat smaller than the “unvoiced speakers ’’ . For any given speech sound, the peak formant fre- quencies of the individual phonemes are the same for the whis- pered and unvoiced versions of an utterance, and such pairs are perceived to have the same linguistic content. However, if the items have the same rms level, the difference in spectral slope means that the lower formants of the whispered sounds have lower levels than the corresponding formants of the unvoiced version, and the higher formants of the whispered sounds have higher levels than the corresponding formants of the unvoiced version. The spectral centroid of “whispered ’’is higher than that of “unvoiced. ’’This suggests that the auditory system is simply using the average spectral centroid or spectral slope information to distinguish speaker size. In this paper, we report a size discrimination experiment designed to test this hypothesis. We measured the psychometric functions and just-noticeable-difference (JND) of size discrim- ination for whispered-whispered pairs as in the previous study [6] and for whispered-unvoiced pairs as a new condition. To ex- plain the results, we constructed a computational model of size discrimination based on the dynamic, compressive gammachirp (dcGC) auditory ﬁlterbank [7, 8, 9, 10] . The model simulates the subject’s decision process based on the auditory spectra gen- erated by the whispered and unvoiced speech sounds, and it illustrates the role of the slope of the speech spectrum in the perception of speaker size. The results show that both the excitation pattern (Ep) and its gradient (∇Ep) were necessary to explain the variability in JND between listeners. ∇Ep supports robust size estimation although it has not commonly been used in conventional speech processing. The use of ∇Ep may well improve the robustness of automatic speaker identiﬁcation and the control of speaker size in HMM synthesis. 2. Size discrimination experiment We performed a size discrimination experiment similar to those of previous studies [4, 5, 6]. The main difference was the in- clusion of a condition where whispered and unvoiced speech sounds were compared directly within a trial. The speech sounds were words drawn from a database of Japanese four- mora words (FW03) [11] recorded from four native speakers of Japanese. The words in the database are controlled with respect to both word familiarity and phonetic balance and they were spoken naturally. Four thousand words were categorized into Copyright  2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 1556 10.21437/Interspeech.2015-362