Statistical Phonetic Analysis of the Romanian Language for Speech Recognition and Synthesis Tasks Miruna Stănescu (Paşca), Andi Buzo, Horia Cucu, Corneliu Burileanu University “Politehnica” of Bucharest, Splaiul Independentei nr. 313, Bucharest, Romania mirunapasca@yahoo.com Abstract - This article provides a statistical phonetic analysis based on the largest Romanian text corpus collected so far for research purposes. Several types of phonetic events are analyzed: phones, diphones, triphones, and phone clusters based on the general classifi- cation of phones in the Romanian language. Some interesting conclu- sions are drawn, such as the fact that less than half the diphones cover 99% of the whole text. The article also discusses some usages of these phonetic statistics for spoken language technology tasks. Keywords Spoken language technology; Text-to-speech; Automatic speech recognition; Phonetic event I. INTRODUCTION Research in the field of spoken language technology in- cludes two main subfields: building text-to-speech (TTS) sys- tems, and building automatic speech recognition (ASR) sys- tems. For both tasks, the main resource is the corpus, whether it is a speech corpus, or text corpus. From both points of view, Romanian is still an under-resourced language. TTS and ASR systems are being built or optimized in several research teams, like [1], [2], [3], [4], but text corpora and speech corpora are still relatively small and not always freely available (compared to languages like English, French or German, etc.). As such, corpora acquisition is an ongoing challenge for most spoken language technology tasks, and statistics that would help this process are almost non-existent. For TTS or ASR systems, one of the major inputs is the speech database; when creating such a database for an under- resourced language like Romanian, the first step is selecting the phrases to be recorded. This paper tries to assess the need for taking into account phonetic statistics during this selection phase. This would result in resources and efforts being focused on the most frequent phonetic events, for example creating better trained models for the triphones found most often during speech recognition. We obtained the needed statistics based on the largest text corpus collected for the Romanian language, and the conclusions drawn from these results could influence the future development of Romanian speech processing tasks. The rest of this paper is organized in four sections. Section 2 examines the phonetic particularities of the Romanian lan- guage, as well as other research relevant to the task at hand. Section 3 details the corpus acquisition and processing steps to bring this corpus to a phonetically transcribed form, needed for the phonetic statistics. Section 4 discusses the experimental results and in the end, Section 5 draws some conclusions. II. RELATED WORK To our knowledge, there is only one relevant paper [5] showing different phonetic events (phones, diphones and quin- phones) statistics for Romanian. However, these statistics are computed on a very small corpus of approximately 2500 news- paper sentences, called the RSS-text (Romanian Speech Syn- thesis-text) corpus. These sentences have been selected from a larger (about 1.7 million words) corpus collected from online newspaper articles. The selection was made for the best cover- age of the Romanian diphones with at least 10 occurrences in the words of the DEX (Romanian Dictionary) online database. A similar text selection technique is used by another TTS sys- tem, in [4], and ASR systems [1] currently take phone coverage into account when selecting phrases for the training database. However, since state-of-the-art ASR systems are based on triphones (context-based training for phones), it is clear that such statistics should also be taken into account when develop- ing speech or text corpora. Phonetic statistics have been recently published for other under-resourced languages, like Polish [6] and Turkish [7]. For Polish [6], the authors take into account the gap between words (as a phone pronounced at the beginning of a word is not the same as one pronounced in the middle or at the end of the word), while for Turkish they do not. One of the particularities of the Turkish language [7] is that there are some interesting conclusions to be drawn from orga- nizing the consonants into groups like soft, hard, sustainable, and unsustainable. For the Romanian language, this classifica- tion is different, as detailed below. The Romanian language has some phonetic particularities [4, 8] that have to be taken into account when creating text or speech corpora. Compared to English [9], some phones are missing (e.g. the th in thin); there are two vowels typical for Romanian (ə, ɨ), and a whole new group called semi-vowels (e , j , o1, w). They are similar to the English glides (vowels whose position in the syllable require them to be a little shorter and cannot be stressed), but for Romanian they are classified separately from both vowels and consonants. The sonant non-