Learning and Nonlinear Models (L&NLM) – Journal of the Brazilian Neural Network Society, Vol. 8, Iss. 3, pp. 148-156, 2010.
© Sociedade Brasileira de Redes Neurais (SBRN)
148
A SPOKEN WORD BOUNDARIES DETECTION STRATEGY FOR VOICE
COMMAND RECOGNITION
IGOR S. PERETTA, GERSON F. M. LIMA, JOSIMEIRE A. TAVARES, KEIJI YAMANAKA
Computer Engineering Department, Electrical Engineering Faculty,
Federal University of Uberlandia – P.O. Box 593, 38400-902, Uberlandia, MG, BRAZIL
E-mails: iperetta@gmail.com, gersonlima@ieee.org, josycbelo@gmail.com, keiji@ufu.br
Abstract — The use of voice commands as a new way of interaction between man and machine is the subject of several researches in
recent years and has already been produced commercial and freeware applications. However, considering the achieved results, there is
still a great development potential in this area, particularly in Brazilian Portuguese language. This work proposes: 1. an efficient
method of detecting spoken word boundaries from a recorded signal, using Teager Energy Operator and FIR Filter; 2. the use of
wavelet transform and wavelet packet filter bank as a main tool for feature extraction to feed a multi-layer artificial neural network to
recognize a limited vocabulary of voice commands. The system was developed using a dataset of spoken words from 50 speakers,
using normal pronunciation speed and in an environment without any noise control. Tests with the system show a very good
classification rate and noise robustness.
Keywords — Voice command recognition, spoken word boundaries detection, teager energy operator, discrete wavelet transform,
wavelet packet filter bank, artificial neural network.
1. Introduction
The speech recognition and voice command recognition are an extensive area of research with many possible applications in our daily
lives, while they could simplify many everyday tasks, enable new forms of human-computer interaction, generate innovative controls
to the development of expanded reality, or even support the inclusion of disabled people with severe restrictions of movement.
However, after many years of world-wide researches, there is not still an ultimate application.
Some factors prevent the complete success of this objective. They could be listed: the indeterminacy of equipment’s quality
that will be used to capture the voice; the different levels of noise to which applications are always subjected; the inherent differences
of each independent speaker; even considering the same speaker, there will be speech changes in different situations, caused by
illness, fatigue, or even the so-called Lombard effect; the lack of understanding of all human hearing biological and cognitive
processes. Particularly, Brazilian Portuguese is a greater challenge for speech recognition, considering the amazing variety of accents
throughout the Brazilian territory.
Several conventional preprocessing techniques are well known to speech recognition, as extraction of LPC coefficients [1],
[2], or Mel-frequency Cepstrum coefficients [3]. Wavelets are also used in speech recognition field [4], [5]. Comparison between
some of them can be found in literature [6], [7].
This research has two main premises: first, artificial neural networks (ANN) have achieved several successes in speech
patterns recognition; second, sound analysis made by human ears can be represented by wavelet transforms, at least in its first stage
which is determined by the response function of the human cochlea [8]. The use of wavelet functions to increase robustness to noise
has also been shown, by emulating frequency resolution of the human cochlea [4].
The proposed system was implemented for recognition of a limited vocabulary in Brazilian Portuguese with six voice
commands: ―SOBE‖ (up), ―DESCE‖ (down), ―AZUL‖ (blue), ―VERMELHO‖ (red), ―DIREITA‖ (right), and ―ESQUERDA‖ (left).
The diagram of the system is shown in (Fig. 1).
The used database, obtained with the freeware Audacity® [9], contains three recording versions of each of the six voice
commands. They are voices of 50 speakers, 30 males and 20 females aged between 17 and 40 years. Thus, the database has a total of
900 samples in the Waveform audio format (WAV). The Audacity® software was configured with a sampling frequency of 8kHz and
a length of 16 bits per sample of signal amplitude. The recordings were made using a simple computer micr ophone, with 75Ω of
impedance, in a room with a steady stream of people and no control of noise.