CIEMPIESS: A New Open-Sourced Mexican Spanish Radio Corpus Carlos D. Hern´ andez-Mena, Abel Herrera-Camacho Departamento de Procesamiento Digital de Se ˜ nales Universidad Nacional Aut´ onoma de M´ exico (UNAM) ca hernandez@uxmcc2.iimas.unam.mx, abelhc@hotmail.com Abstract This paper presents the development of the “Corpus de Investigaci´ on en Espa˜ nol de M´ exico del Posgrado de Ingenier´ ıa El´ ectrica y Servicio Social” (CIEMPIESS) that is a new open-sourced corpus extracted from Spanish spoken FM podcasts in the dialect of the center of Mexico. The CIEMPIESS corpus was designed to be used in the field of automatic speech recongnition (ASR) and it is provided with two different kind of pronouncing dictionaries, one of them containing the phonemes of Mexican Spanish and the other containing this same phonemes plus allophones. Corpus annotation took into account the tonic vowel of every word and the four different sounds that letter “x” presents in the Spanish language. CIEMPIESS corpus is also provided with two different language models extracted from electronic newsletters, one of them takes into account the tonic vowels but not the other one. Both the dictionaries and the language models allow users to experiment different scenarios for the recognition task in order to adequate the corpus to their needs. Keywords: spanish radio corpus, mexican spanish corpus, mexican phonemes, mexican allophones 1. Motivation Nowadays Mexican Spanish remains a resource-scarce lan- guage, but this lack of resources is not exclusive for this particular dialect, in general, development of tools for ASR in languages other than English is not always easy and depends on the the language you want to recognize. For example, You can use the CMU-SLMTK 1 to create a language model for your application but not for creating the pronouncing dictionary, because it is only generated with the English phonemes and there is no other similar widespread use tool for every language you may want. Hence, when you want to recognize other languages, you also have to select the appropriate set of phonemes. In case of Spanish language, you can choose between differ- ent computational phonetic alphabets like SAMPA (Wells, 1997) or Worldbet (Hieronymus, 1994), but you have to be careful because these alphabets are created for several lan- guages and dialects, and you may have troubles if you do not have basic knowledge of phonetics. Ideally, for many engineers and computational scientists it would be better if they could have only the set of phoneme and allophones they need with no worries of phonetic issues. Give solu- tions for that is usually responsibility of researchers and specialists into their own countries. In the literature you can find some few corpus for the Span- ish language (see (Llisterri, 2004)), but you have to adapt them to the dialect of Mexico (you can see an example of this kind of adaptation in (Varela et al., 2003)) if you want the best results. This kind of adaptation issues and the scarce resources for the variant of the Spanish spoken in the center of Mexico is our main motivation for the development of the CIEM- PIESS corpus that is an open-source tool designed for the creation of acoustic models for ASR systems. We argue that creation of CIEMPIESS corpus as an open- 1 Statistical Language Modeling Toolkit by Carnegie Mellon University. See http://www.speech.cs.cmu.edu/SLM/toolkit.html source tool is not altruism, but it is a real need for devel- opment of speech technologies for the particular needs of our country. We can find other examples of these kind of “altruistic” ideas on the creation of the operating system “Linux”.“Linux” was created for the necessity of having an open-source operating system for research and at this time it is supported for thousands of programmers all over the world and it is totally free! (you can read two interesting articles of what motivates people to develope free software in (Hars and Ou, 2001; Hertel et al., 2003)). 2. Corpus CIEMPIESS is a radio corpus in the Mexican Spanish spo- ken at the center of Mexico, specifically at Mexico City. This is an important detail because “Mexico’s City popula- tion is representative of the whole country” as we can see in (Pineda et al., 2009). The total extension of CIEMPIESS is 17 hours. CIEMPIESS has been annotated at the word level and tonic vowels have been considered in the transcription files. It is provided the language models and pronouncing dictionaries in order to increase its flexibility for the recognition task as we will see in the following sections. 2.1. Utterances CIEMPIESS corpus has been taken from 43 one-hour du- ration FM radio programs 2 , recorded in MP3 stereo format, using a 44.1 kHz sample rate and a bit-rate of 128 kbps or higher. From these recordings were selected just the utterances considered “clean” that means that the utterances should be made by one only person with no background noises, whispers, music, foreign accents, white noise or static. We based our idea of “clean speech” partially on (Ostendorf et al., 1995). After that, the utterances were transformed into 16,717 16-bit audio files using a sampling rate of 16 kHz in the NIST Sphere PCM mono format. 2 Downloaded from http://podcast.unam.mx/ 371