Segmentation of Speech for Speaker and Language Recognition André G. Adami 1 , Hynek Hermansky 1,2 1 OGI School of Science and Engineering, Oregon Health and Science University, Portland, USA 2 International Computer Science Institute, Berkeley, California, USA {adami, hynek}@asp.ogi.edu Abstract Current Automatic Speech Recognition systems convert the speech signal into a sequence of discrete units, such as phonemes, and then apply statistical methods on the units to produce the linguistic message. Similar methodology has also been applied to recognize speaker and language, except that the output of the system can be the speaker or language information. Therefore, we propose the use of temporal trajectories of fundamental frequency and short-term energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task. 1. Introduction Many sources of information besides the linguistic message are imprinted on the speech signal. Automatic speech recognition (ASR) aims at getting the linguistic message from speech by describing the signal in terms of discrete units such as words. However, one may also attempt to extract other information from speech, like who is speaking or which language is being spoken. For such applications, the underlying concept of words formed by phonemes (implied in most Large-Vocabulary Connected Speech Recognition (LVCSR) systems) may not be necessary – it is sufficient to consistently convert the continuous acoustic speech signal into a string of discrete labeled units. Along these lines, new approaches for speaker and language recognition, based on simple speaker-specific and/or language-specific models [1- 5], started to emerge. Doddington in [1] uses the sequence of words extracted from the speech signal to built statistical models for speaker recognition. Andrews in [4] uses the sequence of phones to capture a speaker’s pronunciation. Torres-Carrasquillo in [5] uses a sequence of tokens obtained from a Gaussian mixture model to model language information. Our research contributes to this emerging direction of research. We use information in prosodic cues (temporal trajectories of a short-term energy and fundamental frequency – f0), as well as coarse phonetic information (broad-phonetic categories - BFC), to segment and label the speech signal into a relatively small number of classes (i.e. significantly less that the context-dependent phonemes of the current LVCSR). We also demonstrate that such strings of labeled sub-word units can be used for building statistical models that can contribute for characterizing speakers and/or languages. This paper is organized as follows: Section 2 describes techniques for segmentation of the speech signal. In Section 3 and 4, we describe the NIST Language Identification task and the NIST Extended-data Speaker Recognition task. Then, we describe applied systems and demonstrate the performance of the proposed approach in speaker and language identification. 2. Speech Segmentation Different speakers and different languages may be characterized by different intonation or rhythm patterns produced by the changes in pitch and in sub-glottal pressure, as well as by different sounds of language. Therefore, the combination of pitch, sub-glottal pressure, and duration that characterizes particular prosodic “gestures”, together with some additional coarse description of used speech sounds, should be useful in extracting speaker [2, 6] and language information [7, 8]. Thus, converting the continuous speech signal into a sequence of discrete units that describe the signal in terms of dynamics of the f0 temporal trajectory (as a proxy for pitch), the dynamics of short-term energy temporal trajectory (as a proxy for subglottal pressure), and possibly also the produced speech sounds, could be used in for building models that that may characterize given speaker and/or language. The speech segmentation is divided into 5 steps: 1) compute the f0 and energy temporal trajectories, 2) compute the rate of change for each trajectory, 3) detect the inflection points (points at the zero-crossings of the rate of change) for each trajectory, 4) segment the speech signal at the detected inflection points and at the voicing starts or ends, and 5) convert the segments into a sequence of symbols by using the rate of change of both trajectory within each segment. Such segmentation is performed over an utterance, which is a period of time when one speaker is speaking. The rate-of-change of f0 and energy temporal trajectories is estimated using their time derivatives. The time derivatives are estimated by fitting a straight line to several consecutive analysis frames (the method often used for estimation of so called “delta features” in ASR). The utterance is segmented at inflection points of the temporal trajectories or at the start or end of voicing. First, we detect the inflection points for each trajectory at the zero- crossings of the derivative, as shown by the filled circles in Figure 1. Second, we segment the utterance using the inflection points from both time contours and the start and end of voicing. Finally, each segment is converted into a set of classes that describes the joint-dynamics of both temporal trajectories. Since there are no f0 values on unvoiced regions, the unvoiced segments constitute one class. Table 1 lists the 5 possible classes used to describe the speech segments. We can also integrate the duration information in each segment class by adding an extra label with the duration information. Since we are using tokens to built models, the segment classes are further split into “Short” and “Long”. For voiced regions, Short is assigned to segments shorter than 8 frames (80 ms). For unvoiced regions, Short is assigned to