Acquisition of Telephone Data from Radio Broadcasts with Applications to Language Recognition ⋆ Oldˇ rich Plchot, Valiantsina Hubeika, Luk´ aˇ s Burget, Petr Schwarz and Pavel Matˇ ejka Speech@FIT, Brno University of Technology, Czech Republic, iplchot|burget|schwarzp|matejkap@fit.vutbr.cz, xhubei00@stud.fit.vutbr.cz Abstract. This paper presents a procedure of acquiring linguistic data from the broadcast media and its use in language recognition. The goal of this work is to answer the question whether the automatically obtained data from broadcasts can replace or augment to the continuous telephone speech. The main challenges are channel compensation issues and great portion of unspontaneous speech in broadcasts. The experimental results are obtained on NIST LRE 2007 evaluation system, using both NIST provided training data and data, obtained from broadcasts. Key words: Language Identification (LID), Broadcast data, Phone call detection, Channel compensation 1 Introduction We introduce a process of automatic acquisition of speech data from the various media sources for the language identification task. The last editions of NIST Language Recognition (LRE) evaluations have shown that both acoustic and phonotactic approaches have reached a certain maturity level in both modeling of target languages and dealing with the influences of different channels. However we are still facing the common problem: the lack of training data. There is no good or large enough database of training data for many languages including even languages like Thai, which is spoken by 65 million speakers. Also, there is an increasing demand to recognize languages from smaller and less populous regions (many of them relevant for security of defense domain). For some of these languages no standard speech resources exists. This work aims at solving this problem using the data acquired from public sources, such as satellite and Internet TVs and radios, which contain conver- sational speech or telephone calls. This approach should provide us with large ⋆ This work was partly supported by European projects AMIDA (FP6-033812), Care- taker (FP6-027231) and MOBIO (FP7-214324), by Grant Agency of Czech Republic under project No. 102/08/0707 and by Czech Ministry of Education under project No. MSM0021630528. The hardware used in this work was partially provided by CESNET under project No. 162/2005. Luk´aˇ s Burget was supported by Grant Agency of Czech Republic under project No. GP102/06/383.