New Wavelet-Based Pitch Detection Method for Human-Robot Voice Interface T.H. Tran, Q.P. Ha, and G. Dissanayake ARC Centre of Excellence for Autonomous Systems (CAS) Faculty of Engineering University of Technology, Sydney, PO Box 123 Broadway NSW 2007 AUSTRALIA E-mail: {ttran, quangha, gdissa}@eng.uts.edu.au Abstract— Speech activated interface between human and autonomous/semi-autonomous systems requires accurate voice detection and recognition. In this area pitch and end-point detection is of vital importance. This paper presents a new method for pitch detection based on the phase of the continuous wavelet transform. The advantage of the proposed technique is that it can serve not only as an accurate pitch detector, but also can offer an efficient solution to the end-point detection problem. Experimental results are provided for the detection of pitch periods and end points in a neural-network based voice enabled wheelchair system. I. INTRODUCTION Human-robot interface plays a very important role in operations of autonomous/semi-autonomous systems that are to interact with people. These interactions must possess a setting that is easy to participate, interesting and intuitive for ordinary users [1]. Verbal communication is the most natural means of interacting with machines. Human-robot voice communication covers many speech research areas such as speech recognition, speech synthesis, speech identification and verification [1-3]. Human-robot voice enabled interface, although still in its infancy, has some successful applications in tour-guide robots [1,4]. On the other hand, for such semi-autonomous systems as a voice-enabled wheelchair, the requirement on the reliability and speaker identification becomes more important. For the recognition of a speaker voice, it is essential to extract those features that are invariant with regard to the speaker while maintaining the uniqueness in order to prevent an impostor. The periodicity of voiced speech known as pitch is considered a key feature that can be used to identify reliably the speaker [5]. A pitch period is thus an important parameter [3,5,6] in accurate voice detection and speaker identification. Estimating pitch periods in speech processing is difficult because pitch frequencies can vary from 60Hz to 500Hz and the pitch period of the same person may vary depending on the emotional state, accents, and other perceptual variables of that person [7,8]. There are a few methods available for pitch period estimation [3, 5-10]. Classical methods, based on the autocorrelation function, average magnitude difference function, and spectrum, are insensitive to non-stationary variations in pitch periods over the segment length and hence unsuitable for low pitched and high pitched speakers [9]. Recently, methods based on the discrete wavelet transform have been developed and shown to be suitable for a wide range of people [6,8,10]. As commented in [11], these methods do not perform well in determining the pitch period under severe noise conditions, which is the case of a wheelchair user whose speech utterance is quite often in a background of noise. For voice control of such systems as a wheelchair, there exists the need for an accurate method for the estimation of the pitch period and the location of speech end points as well. In this paper, a new detection method is proposed based on the phase of the continuous wavelet transform (CWT). Firstly, the relationship between the CWT phase and the pitch phase is established. An effective algorithm for pitch detection is then developed making use of the pitch period parameter. The algorithm is applied to detect starting and ending points of monosyllable words having continuous speech waves. A neural network (NN) is used to learn for the recognition of a number of monosyllable-word voice commands via spectrogram parameters. Features extracted by the CWT and pitch period are used to train the neural network. The results are comparable to those using features extracted by the short time Fourier transform (STFT). The proposed pitch detection method, possessing a reliable performance, is applied to the voice control of a wheelchair. II. PITCH DETECTION USING THE CWT PHASE In speech processing, the pitch period is an important parameter in many applications such as speech compress coding, analysis and synthesis, speech segment and automatic monosyllable-word speech recognition. In a voice controlled wheelchair, the pitch period is used in the end-point detector and as an extracted feature for NN training and voice recognition. The wavelet transform, developed as a branch of applied mathematics in the late 1980’s, has become a 0-7803-8463-6/04/$20.00 ©2004 IEEE Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems September 28 - October 2, 2004, Sendai, Japan 527