MODELLING THE PREPAUSAL LENGTHENING EFFECT FOR SPEECH RECOGNITION: A DYNAMIC BAYESIAN NETWORK APPROACH Ning Ma 1* , Chris D. Bartels 2 , Jeff A. Bilmes 2 and Phil D. Green 1 1 Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK 2 Department of Electrical Engineering, University of Washington, Seattle, WA 98195 {n.ma,p.green}@dcs.shef.ac.uk, {bartels,bilmes}@ee.washington.edu ABSTRACT Speech has a property that the speech unit preceding a speech pause tends to lengthen. This work presents the use of a dynamic Bayesian network to model the prepausal lengthening effect for robust speech recognition. Specifically, we introduce two distributions to model inter-state transitions in prepausal and non-prepausal words, respec- tively. The selection of the transition distributions depends on a ran- dom variable whose value is influenced by whether a pause will ap- pear between the current and the following word. Two experiments are presented here. The first one considers pauses hypothesised dur- ing speech decoding. The second one employs an extra component for speech/non-speech determination. By modelling the prepausal lengthening effect we achieve a 5.5% relative reduction in word er- ror rate on the 500-word task of the SVitchboard corpus. Index Terms— Prepausal lengthening, duration, prosody, ro- bust speech recognition, dynamic Bayesian networks 1. INTRODUCTION Automatic speech recognition (ASR) employing segmental features (e.g. MFCC) has achieved great success, but performance often de- grades dramatically in the presence of noise. One reason is that most ASR systems do not explicitly represent prosodic properties such as duration. Modelling their interaction with words is important as prosodic properties can be relatively insensitive to moderate noise and channel distortions [1]. Their resistance to noise conditions also allows prosody analysis on the training data to be valid for ASR in a condition that is unknown to match the training condition. In this study we propose to model one prosodic property – the prepausal lengthening effect on word durations. The prepausal lengthening effect is the property that before a speech pause, the preceding speech unit tends to lengthen. The na- ture and effects of this property has been well studied in [2, 3, 4] through a series of experiments analysing segmental durations in continuous speech. These studies have given evidence that the syn- tactic pause is one of the primary factors that influence vowel dura- tions for an individual speaker. The lengthening property is thought to be correlated with high-level linguistic structures such as sentence boundaries, syntax and semantics, but it can also be observed in con- nected digits where most linguistic cues are minimised [5]. Since this duration property occurs in speech units such as phones, syllables and words, most research has focused on the use of phone-/word-level models for ASR. In [6] ASR improvements were reported by penalising word hypotheses that are inconsistent * The first author performed the work while visiting the University of Washington, Seattle. with prosodic duration. This idea was extended in [7] and [1] where explicit word-duration models were estimated and employed to re-score word hypotheses in N-best lists. To model prepausal lengthening, separate duration models for words preceding a pause were employed, which significantly reduced word errors. [8, 5] also reported ASR improvements by employing separate duration models for sentence-final words. Prepausal lengthening was also investigated in [9] within a hierarchical duration model framework, although the property was not explicitly modelled. This paper proposes the use of a dynamic Bayesian network (DBN) to model the prepausal lengthening effect for speech recog- nition. Specifically, we introduce two state transition matrices for prepausal and non-prepausal words, respectively. The selection of the transition matrix depends on a random variable whose value is in- fluenced by whether a pause will appear between the current and the following word. In this study the 500-word task of the SVitchboard corpus [10] is used, which is a small subset of Switchboard I [11] with closed vocabulary. In Section 2 we will explore the prepausal lengthening effect further using this corpus. Section 3 presents tech- niques to incorporate this property into ASR. Experiments and re- sults will be described in Section 4. Section 5 concludes and presents future directions. 2. PREPAUSAL LENGTHENING IN SVITCHBOARD [sil] you know different ways a family [sil] 0.4 0.8 1.2 1.6 2 [sil] and we didn’t know [sil] Time [sec] 0.4 0.8 1.2 1.6 2 Fig. 1. An example from the SVitchboard corpus to illustrate the prepausal lengthening effect. The transcription is shown at the top of the spectrogram of each audio signal with segmentation indicated by dashed lines. The word know lasts 141 ms in (a) and 436 ms in (b) where it precedes a speech pause ([sil]).