Auditory driven subband speech enhancement for automatic recognition of noisy speech Navneet Upadhyay 1 • Hamurabi Gamboa Rosales 1 Received: 3 August 2016 / Accepted: 18 September 2016 Ó Springer Science+Business Media New York 2016 Abstract Speech recognizers achieve high recognition accuracy under quiet acoustic environments, but their performance degrades drastically when they are deployed in real environments, where the speech is degraded by additive ambient noise. This paper advocates a two phase approach for robust speech recognition in such environ- ment. Firstly, a front end subband speech enhancement with adaptive noise estimation (ANE) approach is used to ﬁlter the noisy speech. The whole noisy speech spectrum is portioned into eighteen dissimilar subbands based on Bark scale and noise power from each subband is estimated by the ANE approach, which does not require the speech pause detection. Secondly, the ﬁltered speech spectrum is processed by the non parametric frequency domain algo- rithm based on human perception along with the back end building a robust classiﬁer to recognize the utterance. A suite of experiments is conducted to evaluate the perfor- mance of the speech recognizer in a variety of real envi- ronments, with and without the use of a front end speech enhancement stage. Recognition accuracy is evaluated at the word level, and at a wide range of signal to noise ratios for real world noises. Experimental evaluations show that the proposed algorithm attains good recognition perfor- mance when signal to noise ratio is lower than 5 dB. Keywords Speech enhancement  Subband approach  Feature extraction  Mel scale  Feature matching  Speech recognition 1 Introduction Speech is one of the most natural modes of communication between humans. In recent years, automatic speech recognition (ASR) emerged as an important aspect of speech technology and applied in many real time applica- tions, such as cellular phones, computers, and security systems. ASR transforms an acoustic speech signal, cap- tured by a microphone, in text sequence, usually in terms of a sequence of words. ASR gives accurate results under quiet or ideal acoustic environments and also for carefully pronounced speech. Real environment different from ideal environment, caus- ing a mismatch (not similar) between training (on clean speech model) and operating (on noisy speech data to verify quality) sets, and consequently, inducing perfor- mance degradation of ASR systems (Juang 1991; Cutajar et al. 2013; Gong 1995; Acero and Stern 1990). Therefore, recognizing speech—either by humans or by machines—in real environment remains a challenging task. A number of approaches have been adopted in the literature that attempt to provide noise immune or robust ASR systems in mis- matched conditions (Cutajar et al. 2013; Gong 1995; Acero and Stern 1990). As the clean speech is not always available in real environments, a realistic approach is to train recognition system on clean speech and use speech enhancement to clean the noisy speech prior to recognition. A number of speech enhancement techniques have been used in the past for speech recognition to achieve robustness with respect to noise (Juang 1991; Cutajar et al. 2013; Gong 1995; Acero and Stern 1990). The spectral subtraction (SpecSub) is one of the promising method for enhancing noisy speech (Boll 1979). The SpecSub works on the assumption that the noise contained in the noisy speech signal is additive and & Navneet Upadhyay navneetbitsp@gmail.com 1 Department of Signal Processing & Acoustics, Faculty of Electrical Engineering, Autonomous University of Zacatecas, 98000 Zacatecas, Mexico 123 Int J Speech Technol DOI 10.1007/s10772-016-9370-4