THE CONTRIBUTION OF CEPSTRAL AND STYLISTIC FEATURES TO SRI’S 2005 NIST SPEAKER RECOGNITION EVALUATION SYSTEM Luciana Ferrer 1,2 Elizabeth Shriberg 1,3 Sachin S. Kajarekar 1 Andreas Stolcke 1,3 Kemal Sönmez 1 Anand Venkataraman 1 Harry Bratt 1 1 SRI International, Menlo Park, CA, USA 2 Department of Electrical Engineering, Stanford University, Stanford, CA, USA 3 International Computer Science Institute, Berkeley, CA, USA ABSTRACT Recent work in speaker recognition has demonstrated the advantage of modeling stylistic features in addition to traditional cepstral features, but to date there has been little study of the relative contributions of these different feature types to a state-of-the-art system. In this paper we provide such an analysis, based on SRI’s submission to the NIST 2005 Speaker Recognition Evaluation. The system consists of 7 subsystems (3 cepstral, 4 stylistic). By running independent N-way subsystem combinations for increasing values of N, we find that (1) a monotonic pattern in the choice of the best N systems allows for the inference of subsystem importance; (2) the ordering of subsystems alternates between cepstral and stylistic; (3) syllable-based prosodic features are the strongest stylistic features, and (4) overall subsystem ordering depends crucially on the amount of training data (1 versus 8 conversation sides). Improvements over the baseline cepstral system, when all systems are combined, range from 47% to 67%, with larger improvements for the 8-side condition. These results provide direct evidence of the complementary contributions of cepstral and stylistic features to speaker discrimination. 1. INTRODUCTION Automatic speaker recognition is the task of identifying a speaker based on his or her voice. Conventional systems for this task use features extracted from very short time segments of speech, and model spectral information using Gaussian mixture models (GMMs) [1]. This approach, while successful in matched acoustic conditions, suffers significant performance degradation in the presence of handset mismatch or ambient noise. Furthermore, since spectral information is not modeled as a sequence, short-term cepstral modeling fails to capture longer- range stylistic aspects of a person’s speaking behavior, such as lexical, rhythmic, and intonational patterns. Recently, it has been shown that systems based on longer-range stylistic features provide significant complementary information to the conventional system [2, 3]. In addition, modeling of spectral information by GMMs can be improved or complemented by the use of other modeling techniques like support vector machines (SVMs) [4,5], or by transformations of the cepstral space [6]. The National Institute of Standards in Technology (NIST) conducts annual speaker recognition evaluations (SREs) to allow for meaningful comparisons of different approaches and to assess their performance relative to state-of-the-art systems. In this paper, we describe SRI’s submission to the 2005 SRE. The system uses a number of novel long-range features, as well as new approaches to short-term cepstral modeling, and achieved outstanding results in the evaluation. The main focus of this paper, besides describing the submitted system, is on the analysis of the relative importance of the cepstral and stylistic subsystems we have developed. This is essential for the understanding of the source of the achieved improvements in performance with respect to the baseline cepstral GMM system and for guiding future research. The remainder of the paper is organized as follows. Section 2 briefly describes the evaluation setup, the development datasets and the speech recognition system used. Sections 3 and 4 summarize the subsystems included in our submission and the methods used to combine them. Section 5 presents results and an analysis of subsystem contributions. Final conclusions are given in Section 6. 2. BASIC SETUP The 2005 NIST SRE dataset (referred to as SRE05) is part of the conversational speech data recorded in the Mixer project. The data contains mostly English speech and was recorded over telephone (landline and cellular) channels. The evaluation consists of twenty main conditions differing in the amount of available training and test data, and in the recording conditions [7]. The core condition, for which all evaluation participants are required to submit results, allows one side of a telephone conversation for training and another side for testing. The common condition is defined as the subset of trials for any of the main conditions for which all train and test conversations were spoken in English using handheld phones. We submitted results for the (1-side train, 1-side test) and (8-side train, 1-side test) conditions. The common condition subset for these conditions consisted of 20,907 and 15,947 trials respectively. The main performance metric in the NIST SRE is the detection cost function (DCF), defined as the Bayes risk with P target = 0.01, C fa =1, and C miss =10 [7]. In this paper, results are presented in terms of the minimum value of the DCF measure over all possible score thresholds and the equal error rate (EER), for the trials corresponding to the common condition. The component subsystems and combiners described in this paper were developed using three different data sets: