Speech Refinement Using Custom Filter for
Developing Robust S2S Dataset
Olaniyan Julius Ibidun. C. Obagbuwa
Department of Computer Science,College of Pure and
Applied Sciences, Landmark University,
Omu-Aran, Kwara-State, Nigeria.
olaniyan.julius@lmu.edu.ng
Department of Computer Science & Information
Technology,School of Natural and Applied Sciences, Sol
Plaatje University, Kimberley, South Africa.
ibidun.obagbuwa@spu.ac.za
Ayodele A. Adebiyi
Department of Computer Science, College of Pure and
Applied Sciences,Landmark University,
Omu-Aran, Kwara-State, Nigeria.
ayo.adebiyi@lmu.edu.ng
Esiefarienrhe B. Michael
Department of Computer Science, North-West University,
Mafikeng, South Africa.
25840525@nwu.ac.za
Abstract - Neural Network-based speech-to-speech (S2S)
translators require a robust and well-refined dataset of audio
signals, from which they make automatic translations. These
signals, during recording, are usually accompanied by some
noises, which normally alter the information conveyed by the
original noiseless signal. This type of problem affects the accuracy
of the translation system. Although many de-noising techniques
have been proposed by researchers for removing noises from raw
audio signals, however, this paper presents a novel approach for
noise removal from the audio signal using custom filter based on
Short-Time Fourier Transform (STFT). By using the LJ Speech
Dataset publicly available on Kaggle website, experimental results
show that the research technique boosts signal strength (SNR) by
1.001db on average, thus making it an effective method of
eliminating noise from speech and other useful acoustic signals.
Index Terms - Automatic Translations, Audio Signals,
Neural Network, Short-Time Fourier Transform, Signal-to-noise
ratio.
I. INTRODUCTION
In general, speech refinement can be used to improve
the quality of speech processing equipment, such as digital
hearing aids, cell phones, and other man-machine interfaces in
our daily lives, to make them more durable in noisy ambient
situations [1]. One of the most fundamental ways that people
can communicate with one another and express emotion is
through speech. Speech is a means of message transmission for
humans. A filter that acts on an excitation waveform can be
used to model human speech. Unvoiced speech and voiced
speech can both be classified [2]. In the spectrum of spoken
speech, energy is focused at specific frequencies, namely the
fundamental frequency of the vocal folds and its multiples
(harmonics). The vocal tract's small constriction, which is
generated by air moving quickly through it, causes a random
excitation that resembles white noise and accounts for around
one-third of speech that is entirely periodic [3].
In order to improve our ability to communicate with a
band width of only 4 kHz, there are circumstances that
frequently arise when we measure the voice signal and then
transform it into another form [4]. For telephone conversations,
the analog to digital converter samples electrical voice at 8000
samples per second, enabling digital transmission and speech
signal processing. [5]. Background noise is one of the most
frequent noise sources since it is always present, no matter
where you are. Other types of noise include quantization noise
[6], which is caused by over-compressing speech signals, multi-
talker babble, channel noise, which affects both analog and
digital transmission, and channel noise, which can be either
delayed or reverberated.
Even while denoising has long been a focus of
research, there is always space for advancement. Three primary
goals are generally associated with speech enhancement 1) To
make the processed speech sound better in order to lessen
listener fatigue by improving perceptual factors like quality and
intelligibility. 2) to increase the robustness of speech coders,
which are frequently negatively impacted by noise. 3) to
improve voice recognition systems' accuracy when used in loud
settings [7].
Specifically, Speech-to-speech (S2S) translators are
software that is used to automatically translate one spoken
language into another spoken language by a process called
translation [8], [9]. There are various ways by which S2S
systems are implemented but the most novel and trending one
is the neural network-based speech translation systems [10].
Neural networks or artificial neural networks (neural nets for
short) are computing paradigms, purposely designed to process
information in a way the human brains process information
using neurons[11]. Such S2S translation systems that use deep
learning algorithms (Artificial Neural Networks) for translating
a speech in one language into a speech in another language are
called Neural Network-based S2S translators. These systems
2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) | 979-8-3503-2478-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/SEB-SDG57117.2023.10124474
Authorized licensed use limited to: Kaunas University of Technology. Downloaded on May 23,2023 at 15:33:00 UTC from IEEE Xplore. Restrictions apply.