Speech Refinement Using Custom Filter for Developing Robust S2S Dataset Olaniyan Julius Ibidun. C. Obagbuwa Department of Computer Science,College of Pure and Applied Sciences, Landmark University, Omu-Aran, Kwara-State, Nigeria. olaniyan.julius@lmu.edu.ng Department of Computer Science & Information Technology,School of Natural and Applied Sciences, Sol Plaatje University, Kimberley, South Africa. ibidun.obagbuwa@spu.ac.za Ayodele A. Adebiyi Department of Computer Science, College of Pure and Applied Sciences,Landmark University, Omu-Aran, Kwara-State, Nigeria. ayo.adebiyi@lmu.edu.ng Esiefarienrhe B. Michael Department of Computer Science, North-West University, Mafikeng, South Africa. 25840525@nwu.ac.za Abstract - Neural Network-based speech-to-speech (S2S) translators require a robust and well-refined dataset of audio signals, from which they make automatic translations. These signals, during recording, are usually accompanied by some noises, which normally alter the information conveyed by the original noiseless signal. This type of problem affects the accuracy of the translation system. Although many de-noising techniques have been proposed by researchers for removing noises from raw audio signals, however, this paper presents a novel approach for noise removal from the audio signal using custom filter based on Short-Time Fourier Transform (STFT). By using the LJ Speech Dataset publicly available on Kaggle website, experimental results show that the research technique boosts signal strength (SNR) by 1.001db on average, thus making it an effective method of eliminating noise from speech and other useful acoustic signals. Index Terms - Automatic Translations, Audio Signals, Neural Network, Short-Time Fourier Transform, Signal-to-noise ratio. I. INTRODUCTION In general, speech refinement can be used to improve the quality of speech processing equipment, such as digital hearing aids, cell phones, and other man-machine interfaces in our daily lives, to make them more durable in noisy ambient situations [1]. One of the most fundamental ways that people can communicate with one another and express emotion is through speech. Speech is a means of message transmission for humans. A filter that acts on an excitation waveform can be used to model human speech. Unvoiced speech and voiced speech can both be classified [2]. In the spectrum of spoken speech, energy is focused at specific frequencies, namely the fundamental frequency of the vocal folds and its multiples (harmonics). The vocal tract's small constriction, which is generated by air moving quickly through it, causes a random excitation that resembles white noise and accounts for around one-third of speech that is entirely periodic [3]. In order to improve our ability to communicate with a band width of only 4 kHz, there are circumstances that frequently arise when we measure the voice signal and then transform it into another form [4]. For telephone conversations, the analog to digital converter samples electrical voice at 8000 samples per second, enabling digital transmission and speech signal processing. [5]. Background noise is one of the most frequent noise sources since it is always present, no matter where you are. Other types of noise include quantization noise [6], which is caused by over-compressing speech signals, multi- talker babble, channel noise, which affects both analog and digital transmission, and channel noise, which can be either delayed or reverberated. Even while denoising has long been a focus of research, there is always space for advancement. Three primary goals are generally associated with speech enhancement 1) To make the processed speech sound better in order to lessen listener fatigue by improving perceptual factors like quality and intelligibility. 2) to increase the robustness of speech coders, which are frequently negatively impacted by noise. 3) to improve voice recognition systems' accuracy when used in loud settings [7]. Specifically, Speech-to-speech (S2S) translators are software that is used to automatically translate one spoken language into another spoken language by a process called translation [8], [9]. There are various ways by which S2S systems are implemented but the most novel and trending one is the neural network-based speech translation systems [10]. Neural networks or artificial neural networks (neural nets for short) are computing paradigms, purposely designed to process information in a way the human brains process information using neurons[11]. Such S2S translation systems that use deep learning algorithms (Artificial Neural Networks) for translating a speech in one language into a speech in another language are called Neural Network-based S2S translators. These systems 2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) | 979-8-3503-2478-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/SEB-SDG57117.2023.10124474 Authorized licensed use limited to: Kaunas University of Technology. Downloaded on May 23,2023 at 15:33:00 UTC from IEEE Xplore. Restrictions apply.