Using Voice Quality Features to Improve Short-Utterance, Text-Independent Speaker Veriﬁcation Systems Soo Jin Park 1 , Gary Yeung 1 , Jody Kreiman 2 , Patricia A. Keating 3 , and Abeer Alwan 1 1 Dept of Electrical Engineering, University of California Los Angeles, USA 2 Dept of Head and Neck Surgery, School of Medicine, University of California Los Angeles, USA 3 Dept of Linguistics, University of California Los Angeles, USA sj.park@ucla.edu, garyyeung@g.ucla.edu, jkreiman@ucla.edu, keating@humnet.ucla.edu, alwan@ee.ucla.edu Abstract Due to within-speaker variability in phonetic content and/or speaking style, the performance of automatic speaker veriﬁca- tion (ASV) systems degrades especially when the enrollment and test utterances are short. This study examines how dif- ferent types of variability inﬂuence performance of ASV sys- tems. Speech samples (< 2 sec) from the UCLA Speaker Vari- ability Database containing 5 different read sentences by 200 speakers were used to study content variability. Other samples (about 5 sec) that contained speech directed towards pets, char- acterized by exaggerated prosody, were used to analyze style variability. Using the i-vector/PLDA framework, the ASV sys- tem error rate with MFCCs had a relative increase of at least 265% and 730% in content-mismatched and style-mismatched trials, respectively. A set of features that represents voice qual- ity (F0, F1, F2, F3, H1-H2, H2-H4, H4-H2k, A1, A2, A3, and CPP) was also used. Using score fusion with MFCCs, all con- ditions saw decreases in error rates. In addition, using the NIST SRE10 database, score fusion provided relative improvements of 11.78% for 5-second utterances, 12.41% for 10-second ut- terances, and a small improvement for long utterances (about 5 min). These results suggest that voice quality features can improve short-utterance text-independent ASV system perfor- mance. Index Terms: speaker recognition, within-speaker variability, voice quality 1. Introduction A single speaker’s voice can vary dramatically in different situ- ations. Word choices, mood, intentions, health conditions, and the relationship to the listener all affect the acoustic character- istics of that person’s voice. Such within-speaker variability causes major difﬁculties when identifying speakers from their voices. This problem becomes critical when the utterances used to enroll and verify speakers are short. For instance, the equal error rate (EER) for text-independent automatic speaker veriﬁ- cation (ASV) is 1.59–2.48% for 2-minute utterances, while the EER skyrockets to 10.52–21.83% for 5-second utterances [1, 2]. A possible interpretation of this phenomenon is that shorter ut- terances cannot capture all the variability in a speaker’s voice. This within-speaker variability falls into two categories: extrin- sic variability and intrinsic variability [3]. Extrinsic variabil- ity includes variability that is out of the speaker’s control, such as recording conditions, channel types, and noise. Intrinsic variability includes variability that characterizes the speaker’s voice, such as word choice, articulation, emotion, and speaking style. Although extrinsic variability also affects the system per- formance, we focus on intrinsic variability in this study. We are most interested in ﬁnding speaker-characterizing features that are robust to the intrinsic variability, even in short utterances. Conventional acoustic features such as mel-frequency cep- stral coefﬁcients (MFCCs) are effective in various speech pro- cessing applications, but they might not be sufﬁcient for ASV when within-speaker variability is large. For instance, while MFCCs are successful at capturing the overall spectral enve- lope, they obscure ﬁne vocal structures, which also have an abundance of speaker-speciﬁc information. Because the spec- tral envelope varies based on phonetic content, long segments of speech with rich phonetic content perform well with MFCCs, but shorter speech segments usually lack the variety of con- tent needed. Researchers have attempted to ﬁnd alternative features. For example, it has been found that voice source- related features improve speaker recognition systems by pro- viding information that complements conventional cepstral fea- tures [4, 5, 6]. Das et al. also reported that features extracted from the voice source signal outperform MFCCs in ASV with test utterances shorter than 3 seconds [7]. In this study, various voice quality features are investigated. Voice quality can be thought of as the “timbre of the voice”. Although it is often associated with the voice source charac- teristics, vocal tract characteristics are also reﬂected. Laver, in his pivotal study, deﬁned voice quality as the characteris- tic auditory coloring of an individual speaker’s voice, encom- passing both laryngeal and supra-laryngeal features [8]. Voice quality has recently gained momentum in speaker recognition communities because humans utilize voice quality to recognize speakers [9, 10]. Even though machines outperform humans in some long-utterance tasks [3, 11], human listener performance does not degrade much when the phonetic content and utter- ance lengths are limited [12]. These ﬁndings suggest that voice quality might provide important information for short-utterance text-independent ASV. In previous work, we have shown that a voice quality fea- ture set inspired by a psycho-acoustic model can predict human speaker perception and improve ASV performance by providing complementary information to MFCCs [13, 14]. In the present study, we extend our previous work by analyzing two types of within-speaker variability: phonetic content and speaking-style variability. Speciﬁcally, we address the following questions: 1) Which voice quality features are able to separate speakers when there is large within-speaker variability? 2) How much does the performance of a state-of-the-art ASV system degrade from content/style variability when the utterances are short, and how much help can the voice quality features contribute in such cases? 3) Would the voice quality features be useful for general short-utterance ASV tasks? Copyright  2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-157 1522