Vol.:(0123456789) 1 3 International Journal of Speech Technology https://doi.org/10.1007/s10772-019-09657-y Bark scaled oversampled WPT based speech recognition enhancement in noisy environments Navneet Upadhyay 1,2  · Hamurabi Gamboa Rosales 1 Received: 27 August 2019 / Accepted: 11 November 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract The performance of speech recognition system degrades signifcantly in real-world environment, is a case of the acoustic mismatch between the training and operating conditions. This paper presents a two-stage approach to make a speech recog- nition system immune to additive and uncorrelated background noise i.e. robust. In the frst stage, an oversampled wavelet packet decomposes the entire input noisy speech into seventeen nonlinear frequency subbands like the Bark scale of the human hearing system and the adaptive noise estimation based spectral subtraction flters the noisy speech from each sub- band signal. The oversampled WPT is linear and advantageous as it causes to overcome the shift-invariance complexity by removing the decimation after the fltering at each decomposition level. In the second stage, a nonparametric approach is used for feature extraction from fltered speech, and the parameters from the feature extraction stage are compared with the parameters extracted from speech signals stored in a template to recognize the utterance. A series of experiments are car- ried out to evaluate the performance of the proposed two-stage system in a variety of real environments, with and without the use of the frst stage. Recognition accuracy is evaluated at the word level in a wide range of SNRs for various types of noisy environments. The experimental results show signifcant improvement in recognition performance at low SNR using the proposed system. Keywords Speech enhancement · Oversampled WPT · Bark and Mel frequency scale · Hidden Markov model · Speech recognition 1 Introduction Speech is the primary form of communication among humans to get information. Over the years, automatic speech recognition (ASR) has come up as the key aspect of speech technology, which provides easy accessibility for human to machine communication. Speech recognition (or speech to text) is the ability of a machine to recognize naturally fow- ing human speech, such as words or phonemes and sentences from a wide variety of users. ASR systems can be catego- rized into two main-components, a front-end (or feature extractor) and a back-end (or recognizer). The feature extrac- tor is used to obtain a compact representation of a speech signal that compresses the relevant information into a small number of coefcients. The back-end module recognizes the input signal using the features extracted by the front-end (Cutajar et al. 2013; Benzeghiba et al. 2007). The conventional feature-based speech recognition sys- tems perform well in a clean environment while its perfor- mance degrades dramatically when diferences exist between environments during the training and test data conditions. These diferences, known as mismatched conditions, are due to degradation of speech signals by acoustic background noise, reverberation, etc. Most studies show the robustness of speech recognition systems under mismatched condi- tions, but ASR systems are always below the level of human speech recognition capability. The methods to compensate for the efects of the environmental mismatch can be imple- mented at the front-end or the back-end or both (Gong 1995; Juang 1991; Acero and Stern 1990). * Navneet Upadhyay navneetbitsp@gmail.com 1 Department of Signal Processing and Acoustics, Faculty of Electrical Engineering, Autonomous University of Zacatecas, 98000 Zacatecas, Mexico 2 Department of Electronics and Communication Engineering, The LNM Institute of Information Technology, Jaipur 302 031, India