Accepted and Presented in National Conference on Emerging Global Trends in Engineering & Technology (EGTET), 2014 (7-8 March 2014), Conducted by Assam Don Bosco University, Guwahati, India. A New Approach to Real Time Voice Activity Detection in Industrial Background Noise Sivaranjan Goswami Department of Electronics and Communication Engineering, Gauhati University Guwahati, Assam, India 781014 sivgos@gmail.com Abstract— Voice activity detection is a very fundamental pre- processing task that is followed by various speech processing operations for different purposes. The work proposes a novel approach for robust voice activity detection in heavy industry or running vehicle with very low computational complexity. In a heavy industry or a running vehicle, the background noise can be represented by a wide sense stationary random process for a period which is longer than the stationarity period of speech. Hence the difference between means or autocorrelations of successive frames each of equal time duration, slightly greater than the pitch period of human voice, can be used as a feature for voice activity detection. For noise-mixed speech, the difference will be much greater than zero, whereas for silence, the difference will be ideally equal to zero. Since the feature is a difference, the threshold value for decision making is independent on noise power level. The computational complexity of the approach is very low. Experiment using synthetic mixture of the speech with various sources of background noise shows an average error of 22%, however, qualitative analysis on speech recorded in real-world background noise is found to exhibit better result. Keywords— voice activity detection (VAD), random process, stationarity, auto-correlation function (ACF), mean I. INTRODUCTION Voice activity detection (VAD) is the first stage of various speech processing and speech communication tasks. In a quiet background voice activity can be easily detected by observing the short time average power of the input speech signal; however the presence of background noise makes it a challenging task. The required characteristics for an ideal voice activity detector are reliability, robustness, accuracy, adaptation, simplicity, real-time processing, and the absence of the need for prior knowledge of the noise [1]. Over the years a number of VAD algorithms have evolved using various approaches such as full-band and subband energies, spectrum divergence measures between speech and background noise, pitch estimation, zero crossing rate, higher order statistics etc. [2]. In most of the approaches the feature is extracted from the input speech and subjected to threshold conditions for decision which sometimes can follows some pre-processing. The work here proposes a new approach for voice activity detection in heavy industry or moving vehicle. The basic approach of the present work is based on the fact that the background noise in a factory is basically the sound of machines which can be represented by a wide sense stationary random process. The sound produced by a machine is stationary in nature. However, human voice cannot be represented by a stationary random process if frame size is longer than the pitch period of speech. Hence, if there is no voice the difference between statistical properties of two consecutive frames will be very small. On the other hand for noise mixed speech this difference will be much higher. The next section II contains a brief discussion on the stationarity issues of human speech and background noise. Section III the proposed algorithm is discussed in details. A detailed discussion on the experiments performed, the experimental results along with the real-time implementation of the algorithm are covered in section IV. Finally, sections V and VI contain discussion and conclusion respectively. II. SPEECH SIGNAL AND BACKGROUND NOISE A. Stationarity of Speech Signal Fig. 1. Source/system model for a speech signal. Speech can be represented phonetically by a finite set of symbols called the phonemes of the language, the number of which depends upon the language and the refinement of the analysis [3]. Each phoneme corresponds to a distinctly different waveform. A common speech generation model in digital signal processing is shown in figure 1. Speech signal be classified into two broad categories, namely, voiced and unvoiced sound. Voiced sound has high amplitude and low frequency (a quasi-periodic excitation) whereas unvoiced