J Braz Comput Soc (2011) 17: 53–68
DOI 10.1007/s13173-010-0023-1
ORIGINAL PAPER
Free tools and resources for Brazilian Portuguese speech
recognition
Nelson Neto · Carlos Patrick · Aldebaro Klautau ·
Isabel Trancoso
Received: 5 July 2010 / Accepted: 19 October 2010 / Published online: 4 November 2010
© The Brazilian Computer Society 2010
Abstract An automatic speech recognition system has
modules that depend on the language and, while there are
many public resources for some languages (e.g., English
and Japanese), the resources for Brazilian Portuguese (BP)
are still limited. This work describes the development of re-
sources and free tools for BP speech recognition, consisting
of text and audio corpora, phonetic dictionary, grapheme-
to-phone converter, language and acoustic models. All of
them are publicly available and, together with a proposed
application programming interface, have been used for the
development of several new applications, including a speech
module for the OpenOffice suite. Performance tests are pre-
sented, comparing the developed BP system with a com-
mercial software. The paper also describes an application
that uses synthesis and speech recognition together with a
natural language processing module dedicated to statistical
machine translation. This application allows the translation
of spoken conversations from BP to English and vice versa.
The resources make easier the adoption of BP speech tech-
nologies by other academic groups and industry.
Keywords Speech recognition · Brazilian Portuguese ·
Grapheme-to-phone conversion · Application programming
interface · Speech-based applications
N. Neto ( ) · C. Patrick · A. Klautau
Federal University of Pará, Augusto Correa, 1, Belém, Brazil
e-mail: nelsonneto@ufpa.br
C. Patrick
e-mail: patrickalves@ufpa.br
A. Klautau
e-mail: aldebaro@ufpa.br
I. Trancoso
IST/INESC-ID, Alves Redol, 9, Lisbon, Portugal
e-mail: isabel.trancoso@inesc-id.pt
1 Introduction
Speech processing includes several technologies, among
which automatic speech recognition (ASR) [1, 2] and text-
to-speech (TTS) [3, 4] are the most prominent. TTS systems
are software modules that convert natural language text into
synthesized speech [5]. ASR can be seen as the TTS in-
verse process in which the digitized speech signal is con-
verted into text. In spite of problems such as limited robust-
ness to noise, ASR also has its market, which, according
to Opus Research, topped one billion dollars for the first
time in 2006 and is expected to reach US$ 3 billions in 2010
with niches such as medical reporting and electronic health
care record. Dominated in the past by companies specialized
in ASR, the market currently has players such as Microsoft
and Google, heavily investing in supporting ASR (and TTS)
on Windows [6] and Chrome [7], for example. This work
presents the results of an ambitious project, which aims at
helping the academy and software industry in the develop-
ment of speech science and technology focused in BP.
ASR is a data-driven technology that requires a rela-
tively large amount of labeled data. The researchers rely
on public corpora and other speech-related resources to ex-
pand the state of the art. Some research groups have pro-
prietary speech and text corpora [8–10]. For European Por-
tuguese (EP), the main resource collection efforts have tar-
geted Broadcast News (BN), aiming at automatic caption-
ing applications for the deaf community. The manually la-
beled BN corpus contains around 60 hours of audio, but
even with this limited size, it has already allowed the de-
ployment of a fully automatic subtitling system [11], on line
at the public TV channel since March 2008. Other speech
corpora have been collected for other domains: BDPub-
lico [12] (EP database equivalent to the Wall Street Jour-
nal corpus [13]), CORAL [14] (map-task dialog corpus),