IMPROVED FEATURE DECORRELATION FOR HMM-BASED SPEECH RECOGNITION Kris Demuynck, Jacques Duchateau, Dirk Van Compernolle * and Patrick Wambacq K.U.Leuven - ESAT - PSI, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium E-mail: Kris.Demuynck@esat.kuleuven.ac.be ABSTRACT In most HMM-based recognition systems, a mixture of diago- nal covariance gaussians is used to model the observation density functions in the states. The use of diagonal covariance gaussians however assumes that the underlying data vectors have uncorre- lated vector components: if each gaussian is replaced with its full covariant counterpart, the off-diagonal elements in the covariance matrices should be small. To that end, most recognition systems have some kind of decorrelation matrix near the end of the prepro- cessing. Examples are the inverse cosine transform used with cep- stral coefficients, and principal component analysis (PCA) or lin- ear discriminant analysis (LDA) of the features. However, none of these transforms is optimal if it comes to reducing the mismatch introduced by setting the off-diagonal elements in the covariance matrices to zero. The algorithm described in this paper reduces the local corre- lations between feature vector components inside the gaussians with a single global linear transform at the end of the preprocess- ing stage. The algorithm is optimal in the sense that we calculate the linear transformation that minimises the sum of the square of all off-diagonal elements over all gaussians. The algorithm is compared with principal component analysis, linear discriminant analysis and the recently published maxi- mum likelihood modelling for semi-tied covariance matrices. The decorrelation method is also evaluated on two speech recognition tasks. A significant relative improvement was achieved in both cases. 1. INTRODUCTION In many speech recognition systems, the observation density functions are modelled as mixtures of diagonal covariance gaus- sians. These mixtures of gaussians are however only approxi- mations of the real distributions. One of the approximations is the assumption that the off-diagonal elements of the covariance matrices of the gaussians are close to zero. To that end, most recognition systems have some kind of parameter decorrelation near the end of the preprocessing. Examples are the inverse co- sine transform used with cepstral transformations, and principal component analysis (PCA) or linear discriminant analysis (LDA) of the features. None of these transforms are however designed in an optimal way as to minimise the magnitude of the off-diagonal Lernout & Hauspie Speech Products. elements in the covariance matrices. The algorithm we propose is optimal in the sense that we calculate the linear transforma- tion that minimises the magnitude of the off-diagonal elements in the covariance matrices over all gaussians with a least-squares method. The remainder of the text is organised as follows. First the decor- relation algorithm is explained in detail. Next, the algorithm is compared with some existing alternatives. Finally, the method is evaluated on two speech recognition tasks, and some remarks are given. 2. ALGORITHM As mentioned above, we search for a single linear transformation of the acoustic features that minimises the average of the square of the off-diagonal elements over a large set of covariance matrices. To compensate for a possible scaling of the axes, the off-diagonal elements are normalised with respect to the diagonal elements. Thus, what is actually minimised is a weighted sum of the square of the correlation coefficients between the parameters, and this simultaneously over all gaussians. Let µ (m) be the mean and Σ (m) the full covariance matrix of gaus- sian m with Σ (m) ij the component on row i and column j . And let N (m) be the number of points assigned to gaussian m with N = N (m) the total number of points in the training data and λ (m) = N (m) /N the weight of the gaussian. We then have to find a transformation matrix A that minimises the following quantity: m λ (m) i = j ˜ Σ (m) ij ˜ Σ (m) ii ˜ Σ (m) jj 2 (1) with ˜ Σ (m) = AΣ (m) A T This quantity can be optimised with numerical techniques, e.g. by decomposing the transformation matrix A into a product of basic transformations of the form (I + δij ) with I the identity matrix and δij a matrix equal to zero except for element (i, j ). The optimisation problem is strongly simplified if the normali- sation with respect to the variance is omitted. As to limit the