J Braz Comput Soc (2011) 17: 53–68 DOI 10.1007/s13173-010-0023-1 ORIGINAL PAPER Free tools and resources for Brazilian Portuguese speech recognition Nelson Neto · Carlos Patrick · Aldebaro Klautau · Isabel Trancoso Received: 5 July 2010 / Accepted: 19 October 2010 / Published online: 4 November 2010 © The Brazilian Computer Society 2010 Abstract An automatic speech recognition system has modules that depend on the language and, while there are many public resources for some languages (e.g., English and Japanese), the resources for Brazilian Portuguese (BP) are still limited. This work describes the development of re- sources and free tools for BP speech recognition, consisting of text and audio corpora, phonetic dictionary, grapheme- to-phone converter, language and acoustic models. All of them are publicly available and, together with a proposed application programming interface, have been used for the development of several new applications, including a speech module for the OpenOffice suite. Performance tests are pre- sented, comparing the developed BP system with a com- mercial software. The paper also describes an application that uses synthesis and speech recognition together with a natural language processing module dedicated to statistical machine translation. This application allows the translation of spoken conversations from BP to English and vice versa. The resources make easier the adoption of BP speech tech- nologies by other academic groups and industry. Keywords Speech recognition · Brazilian Portuguese · Grapheme-to-phone conversion · Application programming interface · Speech-based applications N. Neto () · C. Patrick · A. Klautau Federal University of Pará, Augusto Correa, 1, Belém, Brazil e-mail: nelsonneto@ufpa.br C. Patrick e-mail: patrickalves@ufpa.br A. Klautau e-mail: aldebaro@ufpa.br I. Trancoso IST/INESC-ID, Alves Redol, 9, Lisbon, Portugal e-mail: isabel.trancoso@inesc-id.pt 1 Introduction Speech processing includes several technologies, among which automatic speech recognition (ASR) [1, 2] and text- to-speech (TTS) [3, 4] are the most prominent. TTS systems are software modules that convert natural language text into synthesized speech [5]. ASR can be seen as the TTS in- verse process in which the digitized speech signal is con- verted into text. In spite of problems such as limited robust- ness to noise, ASR also has its market, which, according to Opus Research, topped one billion dollars for the first time in 2006 and is expected to reach US$ 3 billions in 2010 with niches such as medical reporting and electronic health care record. Dominated in the past by companies specialized in ASR, the market currently has players such as Microsoft and Google, heavily investing in supporting ASR (and TTS) on Windows [6] and Chrome [7], for example. This work presents the results of an ambitious project, which aims at helping the academy and software industry in the develop- ment of speech science and technology focused in BP. ASR is a data-driven technology that requires a rela- tively large amount of labeled data. The researchers rely on public corpora and other speech-related resources to ex- pand the state of the art. Some research groups have pro- prietary speech and text corpora [810]. For European Por- tuguese (EP), the main resource collection efforts have tar- geted Broadcast News (BN), aiming at automatic caption- ing applications for the deaf community. The manually la- beled BN corpus contains around 60 hours of audio, but even with this limited size, it has already allowed the de- ployment of a fully automatic subtitling system [11], on line at the public TV channel since March 2008. Other speech corpora have been collected for other domains: BDPub- lico [12] (EP database equivalent to the Wall Street Jour- nal corpus [13]), CORAL [14] (map-task dialog corpus),