A hierarchical duration model for speech recognition based on the ANGIE framework 1 Grace Y. Chung * , Stephanie Sene Spoken Language Systems Group, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Received 8 May 1998; received in revised form 3 November 1998; accepted 6 November 1998 Abstract This paper presents a hierarchical duration model applied to enhance speech recognition. The model is based on the novel ANGIE framework which is a ¯exible uni®ed sublexical representation designed for speech applications. This duration model captures duration phenomena operating at the phonological, phonemic, syllabic and morphological levels. At the core of the modelling scheme is a hierarchical normalization procedure performed on the ANGIE parse structure. From this, we derive a robust measure for the rate of speech. The model uses two sets of statistical models ± a ®rst set based on relative duration between sublexical units and a second set based on absolute duration that has been normalized with respect to the speaking rate. We have used this paradigm to explore some speech timing phenomena such as the secondary eects on relative duration due to variations in speaking rate, the characteristics of anomalously slow words, and prepausal lengthening eects. Finally, we successfully demonstrate the utility of durational information for recognition applications. In phonetic recognition, we achieve a relative improvement of up to 7.7% by incorporating our model over and above a standard phone duration model, and similarly, in a word spotting task, an improvement from 89.3 to 91.6 (FOM) has resulted. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Duration modelling; Prosodic modelling; Speech recognition 1. Introduction Durational patterns of phonetic segments and pauses convey information about the linguistic content of an utterance. Listeners make linguistic decisions on the basis of durational cues which can serve to distinguish, for example, between inher- ently long versus short vowels, voiced versus un- voiced fricatives, phrase-®nal versus non-®nal syllables and stressed versus unstressed vowels. Duration is also used to detect the presence or absence of emphasis. Given that such durational information is of perceptual importance to the human listener, it follows that durational information may be ex- tracted for improving speech recognition perfor- mance. It has also been observed that recognition error rates are higher for particularly fast speakers (Pallet et al., 1995) and consequently the ability to handle such variations could boost recognition performance. However, our current understanding of durational patterns and the many sources of variability which aect them, is still sparse. To Speech Communication 27 (1999) 113±134 * Corresponding author. Tel.: +1 617 253 3043; fax: +1 617 258 8642; e-mail: graceyc@mit.edu 1 This research was supported by a contract from Bell Atlantic and by the National Science Foundation under Grant No. IRI-9618731. 0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 7 1 - 5