Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams q Xing Fan, John H.L. Hansen Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX, USA Received 8 August 2011; received in revised form 6 July 2012; accepted 24 July 2012 Available online 25 August 2012 Abstract Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural con- versational scenarios to protect privacy and to avoid certain content from being overheard or made public. Due to the profound differ- ences between whispered and neutral speech in vocal excitation and vocal tract function, the performance of automatic speaker identification systems trained with neutral speech degrades significantly. In order to better understand these differences and to further develop efficient model adaptation and feature compensation methods, this study first analyzes the speaker and phoneme dependency of these differences by a maximum likelihood transformation estimation from neutral speech towards whispered speech. Based on anal- ysis results, this study then considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID on whispered speech without using whispered adaptation data from test speakers. Three estimation methods that model the transformation from neutral to whispered speech are applied, including convolutional transformation (ConvTran), constrained maxi- mum likelihood linear regression (CMLLR), and factor analysis (FA). a speech mode independent (SMI) universal background model (UBM) is trained using collected real neutral features and transformed pseudo-whisper features generated with the estimated transfor- mation. Text-independent closed set speaker ID results using the UT-VocalEffort II corpus show performance improvement by using the proposed training framework. The best performance of 88.87% is achieved by using the ConvTran model, which represents a relative improvement of 46.26% compared to the 79.29% accuracy of the GMM-UBM baseline system. This result suggests that synthesizing pseudo-whispered speaker and background training data with the ConvTran model results in improved speaker ID robustness to whis- pered speech. Ó 2012 Elsevier B.V. All rights reserved. Keywords: Speaker identification; Whispered speech; Vocal effort; Robust speaker verification 1. Introduction Whispered speech is a natural speech production mode, employed in public situations in order to protect privacy and to avoid certain content from being made public. For example, a customer might whisper to provide infor- mation regarding their date of birth, credit card informa- tion, and billing address in order to make hotel, flight, or car reservations through a machine interface over the tele- phone, or a doctor might whisper when entering a voice memo in order to discuss patient medical records in public. Aphonic individuals, as well as those with low vocal 0167-6393/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2012.07.002 q This project was funded by AFRL through a subcontract to RADC Inc. under FA8750-09-C-0067 (Approved for public release, distribution unlimited), and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. Hansen. Corresponding author. Address: Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, Dept. of Electrical Engineering, University of Texas at Dallas, 2601 N. Floyd Road, EC33, Richardson, TX 75080-1407, USA. Tel.: +1 972 883 2910; fax: +1 972 883 2710. E-mail address: John.Hansen@utdallas.edu (J.H.L. Hansen). URL: http://crss.utdallas.edu (J.H.L. Hansen). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication 55 (2013) 119–134