ADVANCES IN LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION IN GREEK: MODELING AND NONLINEAR FEATURES Isidoros Rodomagoulakis 1,3 , Gerasimos Potamianos 2,3 , and Petros Maragos 1,3 1 School of ECE, National Technical University of Athens, 15773 Athens, Greece 2 Department of CCE, University of Thessaly, 38221 Volos, Greece 3 Athena Research and Innovation Center, 15125 Maroussi, Greece irodoma@cs.ntua.gr, gpotam@ieee.org, maragos@cs.ntua.gr ABSTRACT The main goal of this work is the development of an improved Large Vocabulary Continuous Speech Recognition (LVCSR) framework in Greek. Language modeling is carried out in a collection of journalistic text and in the acoustic signal pro- cessing, a nonlinear approach is implemented for deriving features of the AM-FM type. Experimentation is carried out in both clean and simulated far-field speech offering insight about the acoustic modeling under adverse conditions with re- verberation and additive ambient noise. Beyond the baseline implementation, a first step is made in exploring how stan- dard (MFCCs and PLPs) and modulation features (AM-FM) behave in a LVCSR framework when the input speech is dis- tant, like in real life home applications. Index Terms— Speech Processing, Acoustic modeling, Language modeling, Large Vocabulary Continuous Speech Recognition 1. INTRODUCTION One of the main difficulties in Greek Automatic Speech Recognition (ASR) is the complex nature of the language due to multiple inflectional rules. Thus, many efforts have been made in language processing and modeling, but only a few works report extensive results in Large Vocabulary Continuous Speech Recognition (LVCSR) problems which involve many other issues regarding feature extraction, acous- tic modeling and recognition methods. Among these works is the implementation of a dictation system [3] which achieved 19.27% Word Error Rate (WER) by using speaker-independent genomic Hidden Markov Models (HMMs) [2]. A WER re- duction of 0.28% was obtained on the same database by using maximum entropy language models [9] that employ stem in- formation to cope with the very large number of distinct words. Another LVCSR module has been implemented in [6] for a Greek Broadcast transcription system where the reported WER for speaker-independent recognition in mixed record- ing conditions was 38.42%. On Greek phoneme recognition, This research was supported by the European Union under the research program DIRHA with grant FP7-ICT-2011-7-288121. experiments have been conducted on the SpeechDat(II)-FDB- 5000 Greek database [1], yielding 39.06% classification ac- curacy. Finally, some efforts have also been made for ASR in home environments. An implemented system [8] with a low-cost microphone recognized 3.2k Greek words and instructions in spontaneous speech, using a task-dependent grammar yielding 5.25% WER in recordings with 12dB SNR. The acoustic modeling was based on the SpeechDat(II)-FDB- 5000 corpus. Overall, a possible limitation of these works is the employment of standard front-end methods for extracting the traditional cepstral features that are applied only for small and medium vocabulary tasks. Our motivation for this work lies basically in the fact that Greek ASR lacks extensive experimentation in large vocab- ulary databases. Additionally the front-end is mostly based on the linear speech model with the traditional cepstral fea- tures without attempting to extend or to combine with other nonlinear features that overcome some assumptions of the linear speech production. In addition, AM–FM features and their variants have been proved effective for other recognition problems in other languages (Spanish, English). Although some efforts have been made to improve the acoustic and the language modeling for Greek, only a few works have com- pletely integrated all the components in a ASR framework for LVCSR experimentation. The next sections describe all the components of the implemented recognizer and the obtained results in a large vocabulary Greek speech corpus. Language modeling with n-grams is described in Sec. 2, while in Sec. 3, the extraction of standard Mel-Frequency Cepstrum Coeffi- cients (MFCCs), Perceptual Linear Prediction (PLP) coef- ficients and nonlinear AM-FM based features for speech is analysed. Section 4 involves the acoustic modeling with con- text independent triphones and Sec. 5 describes the conducted experiments and the obtained results for clean speech and far- field simulations. Finally, Sec. 6 concludes the paper. 2. LANGUAGE MODELING FOR GREEK ASR This section describes the development of back-off n-gram language models for Greek in the field of journalism. The EUSIPCO 2013 1569746469 1