Explicit Duration Modelling in HMM/ANN Hybrids L´ aszl´ oT´ oth and Andr´ as Kocsor Research Group on Artiﬁcial Intelligence H-6720 Szeged, Aradi v´ ertan´ uk tere 1., Hungary {tothl,kocsor}@inf.u-szeged.hu Abstract. In some languages like Finnish or Hungarian phone duration is a very important distinctive acoustic cue. The conventional HMM speech recognition framework, however, is known to poorly model the duration information. In this paper we compare different duration models within the framework of HMM/ANN hybrids. The tests are performed with two different hybrid models, the conven- tional one and the “averaging hybrid” recently proposed. Independent of the model conﬁguration, we report that the usual exponential duration model has no detectable advantage over using no duration model at all. Similarly, applying the same ﬁxed value for all state transition probabilities, as is usual with HMM/ANN systems, is found to have no inﬂuence on the performance. However, the prac- tical trick of imposing a minimum duration on the phones turns out to be very useful. The key part of the paper is the introduction of the gamma distribution duration model, which proves clearly superior to the exponential one, yielding a 12-20% relative improvement in the word error rate, thus justifying the use of sophisticated duration models in speech recognition. 1 Introduction In some languages like Finnish or Hungarian phone durations may be the only clue in discriminating certain words. Good duration modelling can therefore be an important issue. The conventional HMM speech recognition framework however does not really make use of the duration information. Though the state transition probabilities can be regarded as a geometric duration model, this model is not that effective. First, the geo- metric distribution is a very poor approximation of real phone durations. Second, several authors have reported that the state transition values have practically no inﬂuence on the recognition scores [2]. In this paper we examine the issue of duration modeling within the framework of HMM/ANN hybrids. Two types of hybrid models will be tested: the conventional one known from the literature, and a novel one recently proposed. In both cases we seek to answer two questions. First, we want to either prove or refute the com- mon view that the geometric duration model is wholly ineffective. Second, we would like to know whether the replacement of the geometric model with a more sophisticated gamma distribution can improve the performance of the two hybrids. 2 A Segment-Based View of HMM/ANN Hybrids This paper deals with the kind of HMM models where the usual Gaussian mixture component is replaced by artiﬁcial neural network (ANN) estimates. We will refer to V. Matouˇ sek et al. (Eds.): TSD 2005, LNAI 3658, pp. 310–317, 2005. c  Springer-Verlag Berlin Heidelberg 2005