A Hybrid Approach to Statistical Language Modeling with Multilayer Perceptrons and Unigrams ⋆ Fernando Blat, Mar´ ıa Jos´ e Castro, Salvador Tortajada, and Joan Andreu S´ anchez Departament de Sistemes Inform` atics i Computaci ´ o Universitat Polit` ecnica de Val` encia. E-46022 Val` encia, Spain {fblat,mcastro,stortajada,jandreu}@dsic.upv.es Abstract. In language engineering, language models are employed in order to improve system performance. These language models are usually N -gram mod- els which are estimated from large text databases using the occurrence frequen- cies of these N -grams. An alternative to conventional frequency-based estimation of N -gram probabilities consists on using neural networks to this end. In this pa- per, an approach to language modeling with a hybrid language model is presented as a linear combination of a connectionist N -gram model, which is used to rep- resent the global relations between certain linguistic categories, and a stochastic model of word distribution into such categories. The hybrid language model is tested on the corpus of the Wall Street journal processed in the Penn Treebank project. 1 Introduction Language modeling is the attempt to characterize, capture and exploit regularities in natural language. In problems such as automatic speech recognition, machine trans- lation, text classification or other pattern recognition tasks, it is useful to adequately restrict the possible or probable sequences of units which define the set of sentences (language) allowed in the application task. In general, the incorporation of a language model reduces the complexity of the system (it guides the search for the optimal re- sponse) and increases its success rate. Under a statistical framework, a language model is used to assign to every possible word sequence W an estimation of the a priori probability of being the correct system response Pr(W ) = Pr(w 1 ...w |W | ). Statistical language model are usually based on the prediction of each linguistic unit in the sequence given the preceding ones [1, 2]: Pr(w 1 ...w |W | ) = Pr(w 1 ) ... Pr(w |W | |w 1 ...w |W |-1 )= |W |  i=1 Pr(w i |h i ) , (1) where h i = w 1 ...w i-1 denotes the history from which unit w i has to be predicted. The number of parameters to estimate becomes intractable as the length of the sentence increases. N -gram models [1] are the most extended method to reduce this number ⋆ This work has been supported by the Spanish CICYT under contract TIC2003-07158-C04-03.