Vol.:(0123456789) 1 3
International Journal of Speech Technology
https://doi.org/10.1007/s10772-019-09657-y
Bark scaled oversampled WPT based speech recognition enhancement
in noisy environments
Navneet Upadhyay
1,2
· Hamurabi Gamboa Rosales
1
Received: 27 August 2019 / Accepted: 11 November 2019
© Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract
The performance of speech recognition system degrades signifcantly in real-world environment, is a case of the acoustic
mismatch between the training and operating conditions. This paper presents a two-stage approach to make a speech recog-
nition system immune to additive and uncorrelated background noise i.e. robust. In the frst stage, an oversampled wavelet
packet decomposes the entire input noisy speech into seventeen nonlinear frequency subbands like the Bark scale of the
human hearing system and the adaptive noise estimation based spectral subtraction flters the noisy speech from each sub-
band signal. The oversampled WPT is linear and advantageous as it causes to overcome the shift-invariance complexity by
removing the decimation after the fltering at each decomposition level. In the second stage, a nonparametric approach is
used for feature extraction from fltered speech, and the parameters from the feature extraction stage are compared with the
parameters extracted from speech signals stored in a template to recognize the utterance. A series of experiments are car-
ried out to evaluate the performance of the proposed two-stage system in a variety of real environments, with and without
the use of the frst stage. Recognition accuracy is evaluated at the word level in a wide range of SNRs for various types of
noisy environments. The experimental results show signifcant improvement in recognition performance at low SNR using
the proposed system.
Keywords Speech enhancement · Oversampled WPT · Bark and Mel frequency scale · Hidden Markov model · Speech
recognition
1 Introduction
Speech is the primary form of communication among
humans to get information. Over the years, automatic speech
recognition (ASR) has come up as the key aspect of speech
technology, which provides easy accessibility for human to
machine communication. Speech recognition (or speech to
text) is the ability of a machine to recognize naturally fow-
ing human speech, such as words or phonemes and sentences
from a wide variety of users. ASR systems can be catego-
rized into two main-components, a front-end (or feature
extractor) and a back-end (or recognizer). The feature extrac-
tor is used to obtain a compact representation of a speech
signal that compresses the relevant information into a small
number of coefcients. The back-end module recognizes the
input signal using the features extracted by the front-end
(Cutajar et al. 2013; Benzeghiba et al. 2007).
The conventional feature-based speech recognition sys-
tems perform well in a clean environment while its perfor-
mance degrades dramatically when diferences exist between
environments during the training and test data conditions.
These diferences, known as mismatched conditions, are due
to degradation of speech signals by acoustic background
noise, reverberation, etc. Most studies show the robustness
of speech recognition systems under mismatched condi-
tions, but ASR systems are always below the level of human
speech recognition capability. The methods to compensate
for the efects of the environmental mismatch can be imple-
mented at the front-end or the back-end or both (Gong 1995;
Juang 1991; Acero and Stern 1990).
* Navneet Upadhyay
navneetbitsp@gmail.com
1
Department of Signal Processing and Acoustics, Faculty
of Electrical Engineering, Autonomous University
of Zacatecas, 98000 Zacatecas, Mexico
2
Department of Electronics and Communication
Engineering, The LNM Institute of Information Technology,
Jaipur 302 031, India