SHORT COMMUNICATION Probabilistic model of speech with high spectral resolution using maximum-likelihood estimation Mohammed Usman 1 • Mohammed Zubair M. Shamim 1 Received: 6 October 2018 / Revised: 5 September 2019 / Accepted: 25 October 2019 Ó The National Academy of Sciences, India 2019 Abstract A probabilistic model for the distribution of narrowband short-time Fourier transform (STFT) coefﬁ- cients, having frequency resolution of 5 Hz and below, is proposed for speech signals. An important application is to model speech to depict the perceptual stability of human listening capability which allows humans to perceive speech reliably under a wide range of acoustic conditions. Representation of speech with high spectral resolution ﬁnds applications in the design of digital hearing aids and cochlear implants for people with hearing disability. While speech is generally considered as a non-stationary signal over segment lengths longer than 20–30 ms, the perceptual stability of human auditory system motivates the need for an invariant representation of speech over long segment lengths. Computer modelling shows that STFT coefﬁcients of speech with high-frequency resolution ﬁt reasonably accurately to Laplace distribution (LD). Parameters of the corresponding LD are estimated using maximum-likeli- hood estimation. Crame ´r–Rao bound for the estimated parameters and root mean square error for the ﬁtted dis- tribution conﬁrm the validity of the ﬁtted distribution. Keywords Probabilistic modelling  Speech modelling  STFT  ML estimation Introduction The performance of many speech processing algorithms depends on modelling speech signals using appropriate probability distributions. Various distributions such as the gamma distribution (c-D), Gaussian distribution (GD), generalized Gaussian distribution (GGD), Laplace distri- bution (LD) as well as multivariate Gaussian and Laplace distributions have been proposed in the literature to model different segment lengths of speech, typically below 200 ms in different domains [1, 2]. The probability distribution model that best ﬁts the speech samples depends on factors such as the domain of speech representation, segment length, silence periods in the speech as well as noise. In this letter, we model the distribution of short-time Fourier transform (STFT) coefﬁcients of speech with segment length longer than 500 ms. Modelling speech with high spectral resolution is useful in the study of human auditory system, which shows perceptual stability, to design hearing aids and cochlear implants for people with hearing dis- ability, who are unable to distinguish subtle differences between spectral components. Conventional methods of speech representation emphasize spectro-temporal details which are not relevant for intelligibility of speech. Vari- ability of representation can be reduced by eliminating such details resulting in a more stable model [3]. High spectral resolution modelling of speech also has application to develop speech restoration algorithms, akin to the human brain which is able to restore missing/distorted parts of speech and improve the overall intelligibility of speech [4]. Audio cards introduce d.c offset component to the recordings [5] which can adversely affect the estimated parameters of the ﬁtted distribution by introducing bias. In order to avoid this, a ﬁrst-order d.c removal inﬁnite impulse response (IIR) ﬁlter is used to remove the d.c. & Mohammed Usman omfarooq@kku.edu.sa; musman@ieee.org Mohammed Zubair M. Shamim mzmohammad@kku.edu.sa 1 Department of Electrical Engineering, King Khalid University, Abha, Saudi Arabia 123 Natl. Acad. Sci. Lett. https://doi.org/10.1007/s40009-019-00842-w