Learning and Nonlinear Models (L&NLM) – Journal of the Brazilian Neural Network Society, Vol. 8, Iss. 3, pp. 148-156, 2010. © Sociedade Brasileira de Redes Neurais (SBRN) 148 A SPOKEN WORD BOUNDARIES DETECTION STRATEGY FOR VOICE COMMAND RECOGNITION IGOR S. PERETTA, GERSON F. M. LIMA, JOSIMEIRE A. TAVARES, KEIJI YAMANAKA Computer Engineering Department, Electrical Engineering Faculty, Federal University of Uberlandia – P.O. Box 593, 38400-902, Uberlandia, MG, BRAZIL E-mails: iperetta@gmail.com, gersonlima@ieee.org, josycbelo@gmail.com, keiji@ufu.br Abstract — The use of voice commands as a new way of interaction between man and machine is the subject of several researches in recent years and has already been produced commercial and freeware applications. However, considering the achieved results, there is still a great development potential in this area, particularly in Brazilian Portuguese language. This work proposes: 1. an efficient method of detecting spoken word boundaries from a recorded signal, using Teager Energy Operator and FIR Filter; 2. the use of wavelet transform and wavelet packet filter bank as a main tool for feature extraction to feed a multi-layer artificial neural network to recognize a limited vocabulary of voice commands. The system was developed using a dataset of spoken words from 50 speakers, using normal pronunciation speed and in an environment without any noise control. Tests with the system show a very good classification rate and noise robustness. Keywords — Voice command recognition, spoken word boundaries detection, teager energy operator, discrete wavelet transform, wavelet packet filter bank, artificial neural network. 1. Introduction The speech recognition and voice command recognition are an extensive area of research with many possible applications in our daily lives, while they could simplify many everyday tasks, enable new forms of human-computer interaction, generate innovative controls to the development of expanded reality, or even support the inclusion of disabled people with severe restrictions of movement. However, after many years of world-wide researches, there is not still an ultimate application. Some factors prevent the complete success of this objective. They could be listed: the indeterminacy of equipment’s quality that will be used to capture the voice; the different levels of noise to which applications are always subjected; the inherent differences of each independent speaker; even considering the same speaker, there will be speech changes in different situations, caused by illness, fatigue, or even the so-called Lombard effect; the lack of understanding of all human hearing biological and cognitive processes. Particularly, Brazilian Portuguese is a greater challenge for speech recognition, considering the amazing variety of accents throughout the Brazilian territory. Several conventional preprocessing techniques are well known to speech recognition, as extraction of LPC coefficients [1], [2], or Mel-frequency Cepstrum coefficients [3]. Wavelets are also used in speech recognition field [4], [5]. Comparison between some of them can be found in literature [6], [7]. This research has two main premises: first, artificial neural networks (ANN) have achieved several successes in speech patterns recognition; second, sound analysis made by human ears can be represented by wavelet transforms, at least in its first stage which is determined by the response function of the human cochlea [8]. The use of wavelet functions to increase robustness to noise has also been shown, by emulating frequency resolution of the human cochlea [4]. The proposed system was implemented for recognition of a limited vocabulary in Brazilian Portuguese with six voice commands: ―SOBE‖ (up), ―DESCE‖ (down), ―AZUL‖ (blue), ―VERMELHO‖ (red), ―DIREITA‖ (right), and ―ESQUERDA‖ (left). The diagram of the system is shown in (Fig. 1). The used database, obtained with the freeware Audacity® [9], contains three recording versions of each of the six voice commands. They are voices of 50 speakers, 30 males and 20 females aged between 17 and 40 years. Thus, the database has a total of 900 samples in the Waveform audio format (WAV). The Audacity® software was configured with a sampling frequency of 8kHz and a length of 16 bits per sample of signal amplitude. The recordings were made using a simple computer micr ophone, with 75Ω of impedance, in a room with a steady stream of people and no control of noise.