(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 11, 2023 831 | Page www.ijacsa.thesai.org Speech Enhancement using Fully Convolutional UNET and Gated Convolutional Neural Network Danish Baloch 1 , Sidrah Abdullah 2 , Asma Qaiser 3 , Saad Ahmed 4 , Faiza Nasim 5 , Mehreen Kanwal 6 Department of Computer Science, DHA Suffa University, Karachi, Pakistan 1 Department of Computer Science and Information Technology, NED University of Engineering & Technology, Karachi, Pakistan 2, 5 Department of Computer Science, IQRA University, Karachi, Pakistan 3, 4 MS Fast University, Department of Computer Science, Pakistan 6 Abstract—Speech Enhancement aims to enhance audio intelligibility by reducing background noises that often degrade the quality and intelligibility of speech. This paper brings forward a deep learning approach for suppressing the background noise from the speaker's voice. Noise is a complex nonlinear function, so classical techniques such as Spectral Subtraction and Wiener filter approaches are not the best for non-stationary noise removal. The audio signal was processed in the raw audio waveform to incorporate an end-to-end speech enhancement approach. The proposed model's architecture is a 1-D Fully Convolutional Encoder-to-Decoder Gated Convolutional Neural Network (CNN). The model takes the simulated noisy signal and generates its clean representation. The proposed model is optimized on spectral and time domains. To minimize the error among time and spectral magnitudes, L1 loss is used. The model is generative, denoising English language speakers, and capable of denoising Urdu language speech when provided. In contrast, the model is trained exclusively on the English language. Experimental results show that it can generate a clean representation of a clean signal directly from a noisy signal when trained on samples of the Valentini dataset. On objective measures such as PESQ (Perceptual Evaluation of Speech Quality) and STOI (Short-Time Objective Intelligibility), the performance evaluation of the research outcome has been conducted. This system can be used with recorded videos and as a preprocessor for voice assistants like Alexa, and Siri, sending clear and clean instructions to the device. Keywords—Speech enhancement; speech denoising; deep neural network; raw waveform; fully convolutional neural network; gated linear unit I. INTRODUCTION Speech Enhancement has been a topic of interest for five decades. Speech enhancement aims to improve speech quality (reducing background noise) by various algorithms [1]. The purpose of enhancement is to enhance the intelligibility of the speech signal degraded by the noise using audio signal processing techniques. The conventional methods used for noise reduction are Spectral subtraction and the Wiener filter [2] and [3]. Still, both approaches leave musical artifacts in synthesized speech [4], need multiple sources as noise profile information, and distort the desired output. Deep Learning approaches can overcome the pitfalls of conventional approaches because these systems can learn to map between complex nonlinear functions [5]. In addition, they have the ability to produce desirable outputs that can be used to decrease the Word Error Rate (WER) of automatic speech recognition (ASR) systems [6], boost the performance of speech-to-text systems [7], and in general, increase the intelligibility of speech which can be beneficial for any system whose performance is dependent on the intelligibility of speech. In Deep Learning, the classical approach to suppress noise through the signal is mask-based signal denoising [8], in which DNN models produce a TF mask that filters out the noise and leaves the speech. Mask-based approaches are mostly done on magnitude spectrograms of audio [9], [10]; this creates a challenge of reconstructing the audio again to the time domain once it is filtered using the predicted spectrogram mask and reconstruction of audio is heavily dependent on the phase of noisy input audio. Another investigated approach is a mapping-based approach where a representation of a complex nonlinear noisy signal is directly mapped onto a clean signal [11], [12] and [13]. Mapping-based approaches directly map noisy signals to their clean representations. Due to the fast variation of amplitudes in raw audio waveforms, mapping-based approaches are based on STFT (short-time Fourier Transform) of audio. A. Proposed Approach Our proposed approach is a mapping-based approach in raw audio waveform (time-domain). The loss function is optimized for time and STFT of audio. This approach eliminates the requirement of reconstruction of the audio from the spectrogram output into raw audible audio waveform as in [11] and [12], rather it generates the audible enhanced speech output directly. The magnitude spectrogram of audio is incorporated inside of the loss function rather than as input to the model as in [9] and [13], which gives us leverage to do speech enhancement on raw audio waveform directly. Given an audio, our system directly generates its clean representation without any additional post-processing on the output of the model. The proposed approach focuses on enhancing the speech and suppressing the noise in audio sampled at 22.05 KHz. To achieve this U-Net architecture is used. The choice of this architecture is due to the fact that it takes audio as raw waveform without any manual feature extraction and provides output also in the raw audio waveform, which can be converted to mp3 file and can be saved on disk directly. It consists of convolutional layers and a middle layer which is a