Tuning Phonotactic Language Identificaion System * Pavel Matˇ ejka 1 , Petr Schwarz 1 , Jan ˇ Cernock´ y 1 , Pavel Chytil 2 1 Faculty of Information Technology, Brno University of Technology Brno, Boˇ zetˇ echova 2, CZ 612 66, Phone: +420-5-41141283, Fax: +420-5-41141270 2 Department of Biomedical Engineering, OGI school of Science & Technology, OHSU 20000 NW Walker Rd, Beaverton, OR 97006, USA, Phone: +1-503-748 4068 E-mail: matejkap@fit.vutbr.cz This report provides brief description of Language Identification (LID) system based on phoneme recognizer followed by language models (PRLM). Tuning phoneme recognizers for this task can increase performance of the whole system. Reported results are on data from NIST 2003 LID evaluation. Our system has Equal Error Rate (EER) 5.4% on task with 12 languages. This result compares favorably to the best known Parallel PRLM results from this evaluation. 1 Introduction The goal for LID is to determine the language of particular speech segment. This work concentrates on phono-tactic approach to language identification. Speech signal is first converted into a sequence of meaningful discrete sub-word units (tokens) that can char- acterize language. In our case, these units are phonemes detected by a phoneme recog- nizer. The phoneme strings are modeled by statistical language model. We can consider phonemes as meaningful units, because words in different languages differ and have differ- ent pronunciations. We can use a phoneme recognizer to tokenize speech into phonemes even if this recognizer is not trained on the target language. In this case such tokaniza- tion is closed to transcription of the unknown language by phonemes from language the tokenizer was trained on. This article is description of our baseline system. In section 2, description of whole LID system is given. In section 3, we describe data, evaluation method and experiments. Summary of results, comparison with published results and conclusions are given in sec- tion 4. 2 Description of the System Good tokenizer is the most important part of an accurate LID system. We use a phoneme recognizer – a hybrid system based on Neural Networks (NN) and Viterbi decoder with- out any language model. An unconventional feature extraction technique based on long temporal patterns (TRAPs) [1] is used (see Figure 1). * This work was partially supported by EC project Augmented Multi-party Interaction (AMI), No. 506811, Grant Agency of Czech Republic under project No. 102/05/0278, by industrial grant from CAMEA Ltd. Jan ˇ Cernock´ y was supported by post-doctoral grant of Grant Agency of Czech Republic No. GA102/02/D108.