N-gram Adaptation Using Dirichlet Class Language Model Based on Part-of-Speech for Speech Recognition Abstract: Language model plays an important role in automatic speech recognition (ASR) systems. Performance of this model depends on its adaptation to the linguistic features. Accordingly, adaptation methods endeavour to apply syntactic and semantic characteristics of the language for language modeling. The previous adaptation methods such as family of Dirichlet class language model (DCLM) extract class of history words. These methods due to lake of syntactic information are not suitable for high morphology languages such as Farsi. This work proposes an idea for using syntactic information such as part-of-speech (POS) in DCLM for combining with an n-gram language model. In our proposed approach, word clustering is based on POS of previous words and history words. The performance of language models are evaluated on BijanKhan corpus using a hidden Markov model based ASR system. Our experiments show that using POS information along with history words and class of history words improves language model, and decreases the perplexity on our corpus. Exploiting POS information along with DCLM, the word error rate of the ASR system decreases by 1% in comparison to DCLM. Keywords: speech recognition, language model adaptation, part-of-speech, perplexity, word error rate. 1. Introduction Language model (LM) is important for many natural language and speech processing applications. The goal of LM is to assign probabilities to sequences of words according to a certain distribution. Speech recognition focuses on searching for the best word sequence  ෡ by maximizing a posteriori (MAP) probability of speech utterance :  ෡ ൌ ݎ ݔ݌ሺ|ሻ ൌ ݎ ݔ݌ሺ|ሻ ݌ሺሻ (1) where ݌ሺ|ሻ is the acoustic likelihood given the hidden Markov model (HMM), and ݌ሺሻ is the prior word probability given the LM. N-gram LM is a known approach that assigns probability to next word based on its immediately preceding n-1 history words. In an - gram model [1], the probability of a word sequence ሺ ݓ ଵ  ൌ ሺ ݓ ଵ ݓ,…,  ሻሻ is calculated by multiplying the probabilities of predicted word ݓ ௜ conditioned on its preceding െ1 words depicted by ݓ ௜௡ାଵ ௜ଵ : ݌ሺሻ ൌ ∏ ݌൫ ݓ ௜ ݓ| ଵ ௜ଵ ൯؆∏ ݌൫ ݓ ௜ ݓ| ௜௡ାଵ ௜ଵ ൯  ௜ୀଵ  ௜ୀଵ (2) where ݌൫ ݓ ௜ ݓ| ଵ ௜ଵ ൯ shows the conditional probability of ݓ ௜ given ݓ ଵ ௜ଵ . In order to solve the data sparseness problem in -gram models, class-based LM [2] has been proposed. This method considers the transition probabilities between classes rather than words: ݌൫ ݓ ௜ ݓ| ௜௡ାଵ ௜ଵ ൯ ؆ ݌ሺ ݓ ௜ | ௜ ሻ ݌൫ ௜ | ௜௡ାଵ ௜ଵ ൯ (3) where the class assignment of word ݓ ௜ is  ௜ , ݌ሺ ݓ ௜ | ௜ ሻ is the probability of word ݓ ௜ generated from class  ௜ , and ݌൫ ௜ | ௜௡ାଵ ௜ଵ ൯ is the class based on -gram. The classes are derived according to word clustering based on a criterion such as mutual information. The -gram models suffer from the insufficiencies of long-distance information, which limit the model performance. To compensate this, n-gram model can be combined with the adaptation methods like latent Dirichlet allocation (LDA) that extract the semantic information. LDA [3] provides a powerful mechanism for discovering the structure of a text document. The latent topic of each document is treated as a random variable. To tackle the data sparseness and extract the large-span information for -gram models, in [4], a new Dirichlet class LM (DCLM) is constructed. In this technique, the latent variable reflects the class of an -gram event rather than the topic in LDA model. In addition, Cache DCLM (CDCLM) [4] is proposed to improve DCLM by considering dynamic classes of history words in the online estimation. The previous adaptation methods just used semantic information and did not consider syntactic features. In the languages with high morphology such as Farsi, exploiting the syntactic information such as POS can be useful. Ali Hatami*, Ahmad Akbari**, and Babak Nasersharif*** *Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran, ali_hatami@comp.iust.ac.ir ** Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran, akbari@iust.ac.ir *** Electrical and Computer Engineering Department, K. N. Toosi University of Technology, Tehran, Iran, bnasersharif@kntu.ac.ir