MODELING PROSODIC DYNAMICS FOR SPEAKER RECOGNITION Andre G. Adami 1 , Radu Mihaescu 2 , Douglas A. Reynolds 3 , John J. Godfrey 4 1 OGI School of Science and Engineering, Oregon Health and Science University, 2 Princeton University, 3 MIT Lincoln Laboratory, 4 U.S. DoD adami@ece.ogi.edu, mihaescu@princeton.edu, dar@ll.mit.edu, godfrey@afterlife.ncsc.mil ABSTRACT Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short- term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a pre-defined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77% relative improvement over a system based on short-term pitch and energy features alone. 1. INTRODUCTION Current speaker recognition systems are based primarily on modeling the distributions of short-term spectral features [1]. While these systems produce very good performance, they ignore many other aspects of the speech signal that convey speaker information, such as prosodic information from pitch and energy contours. However, it is clear from results in several published studies (e.g., [2, 3, 4, 5] and their references) that prosodic information can be used to effectively improve performance of and add robustness to speaker recognition systems. Prosodic information has been applied in two main ways. In the first approach, global statistics of some prosodic-based feature are estimated and compared between two utterances. The most common example is comparing the mean and standard deviation of the fundamental frequency between enrollment and test utterances [3]. Alternatively, the prosodic feature may be appended to standard spectral-based features and used in traditional distribution modeling systems. One potential problem with this global statistics approach is that it does not adequately capture the temporal dynamic information of the prosodic feature sequence. This has been addressed in part by using statistics of feature time derivatives and dynamic features derived from segments [2]. The second approach is aimed at explicitly representing and comparing the temporal trajectory of the prosodic contours. The classic example of this approach is applying dynamic time warping (DTW) to compare the pitch contours between two utterances of the same text [6]. This approach has the advantage of potentially being able to capture idiosyncratic speaker-specific temporal dynamic events, but generally requires comparison of the same spoken text to be effective. Due to the lack of control of the spoken text, text- independent applications have generally been limited to using global statistical approaches. In this paper, we present two new approaches that demonstrate effective ways to model and apply prosodic contours for text-independent speaker verification tasks. The first approach uses the relation between dynamics of the fundamental frequency (f0) and energy trajectories to characterize the speaker’s identity. The motivation is that the dynamics of both trajectories can jointly represent certain prosodic gestures that are characteristic of a particular speaker. In addition, the dynamics can also capture the speaking style (for example, excited or monotone) of the speaker. The second approach capitalizes on the increasing accuracy of speech recognition systems on conversational speech to allow explicit template matching of the f0 contours of a predefined set of words and phrases. The motivation is to capture speaker characteristic accent and intonation information from a known set of frequently and naturally occurring words found in conversational speech. For the rest of the paper, we are going to refer to both approaches as prosodic systems. This paper is organized as follows. In Section 2, we describe the NIST Extended Data Task and the prosodic feature database used in this paper. We then describe systems and performance for a baseline system using simple f0 and energy distributions followed by descriptions of the new approaches using f0 and energy contour dynamics and the text-constrained f0 contour matching. In Section 6, we present some fusion results that demonstrate that these new approaches are producing complementary and beneficial information to the speaker recognition task. 2. NIST EXTENDED DATA TASK The work presented in this paper was developed as part of the SuperSID project [7] in the 2002 JHU Summer Workshop. For this project, the development focus was on the Extended Data Task from the 2001 NIST Speaker Recognition Evaluation i . This task was introduced to allow exploration and development of techniques that can exploit significantly more training data than traditionally used in NIST evaluations. For this task, speaker models were trained using 1,2,4,8, and 16 complete conversation halves (where a conversation half is nominally 2.5 minutes long, as opposed to only 2 minutes of training speech. A complete conversation half was used for testing. The 2001 Extended Data Task used the entire Switchboard I conversational, telephone speech corpus in a cross-validation procedure to obtain a large i The 2001 NIST Speaker Recognition Evaluation website:http://www.nist.gov/speech/tests/spk/2001