Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.13, No.3, 2022 20 Voice Activity Detection: Fusion of Time and Frequency Domain Features with A SVM Classifier Sheriff Alimi * Oludele Awodele Department of Computer Science, Babcock University, Ilesan-Remo, Nigeria *E-mail mail of the corresponding author: alimi0356@pg.babcok.edu.ng Abstract Voice activity detection (VAD) discriminates between segments of an audio signal that has speech content from the ones with either noise or silence. It is deployed as the front-end of some speech processing applications such as voice recognition, and speaker recognition to improve their performance in terms of accuracy and efficiency. It is also used in the communication system to bring about efficient utilization of transmission bandwidth by ensuring only segments of the audio signal with voice activity are encoded and transmitted. In this work, the VAD algorithm was implemented using a features-fusion strategy. In the pre-processing stage, contents outside the human auditory frequency range were removed with the aid of a digital Butterworth bandpass filter. The signal was then fragmented into frames from where time-domain features (zero-crossing, standard deviation, normalized envelope, kurtosis, skewness, and root-mean-square energy.) and frequency-domain features (13MFCCs) were extracted and then combined to form a feature representation of each frame. Recursive feature elimination was applied to the dataset to reduce the features to seven (7) which was used to train a Support Vector Machine (SVM) to be able to distinguish between voiced and unvoiced frames. A State-of-art performance was recorded by this simple SVM-based VAD system with an accuracy of 100%, recall of 100%, precision of 100% and F1 score of 100% which is at par with similar implementations which utilizes a complex architecture of deep neural network with high computational cost and training time. Keywords: Voice activity detection, fusion strategy, support vector machine, frequency domain features, time domain features DOI: 10.7176/CEIS/13-3-03 Publication date:May 31 st 2022 1. Introduction In speech signals, some segments have voice activity, and utterances while the other sections have no such activity and are considered silent parts of the signal. In some speech processing applications, it is important to be able to distinguish between the silence and voiced sections, most times the whole speech signal is partitioned into a much smaller unit called a frame whose size is determined by application type and in situations where transforming the discrete-time signal to frequency domain is required, the size is chosen to optimize the performance of discrete Fourier transform (DFT) operation. It then becomes imperative for such an application to be able to decipher at frame level if there is voice activity or not. In a typical speech conversation, it was discovered that the speaker only talks for 40% of the time while the remaining 60% of the time the speaker is idle, the absence of speech activity, the idle part which lacks human utterances considered silence (Krishnakumar & Williamson,2019; Bäckström,2017). Voice Activity Williamson, (AD) is primarily the analysis of audio, and speech signals to determine the regions with an utterance (Lavechin et al., 2020; Bäckström,2017), so the VAD algorithm function as a discriminator in identifying the speech part of an audio signal and eventual discard region of silence. VAD has been tremendously used as the front-end of so many speech processing applications such as speaker recognition, speech recognition, speech enhancement, gender identification and age identification(Mohammed and Hassan, 2020)VAD has significantly helped the back-end applications to improve their performance accuracy and overall processing time(Dey et al., 2019) as silent segments are never passed to them for processing. In digital telephony like GSM technology, VoIP technology and other related communication systems where there is always a contest for the bandwidth available for the transfer of information from one end of the communication to another, it is a big waste, encoding and transferring the silent part of a talk which stands at 60% of the conversation period over a contested transmission media. This is inefficient utilization of scarce communication resources. Many speech processing applications such as speaker recognition, speaker verification, automatic speech recognition, emotion recognition and gender detection deal with classification problems; the use of the silent sections of the speech in both training and validating such systems will yield unsatisfying accuracies. To address these problems, voice activity detection (VAD) will be very useful in discriminating between sections with utterances and those without, to discard the silent ones so that voiced ones are passed for further processing by the back-end system. The resultant effect of the introduction of VAD is that it brings about efficient utilization of transmission bandwidth and improves the accuracies of speech processing applications