I.J. Image, Graphics and Signal Processing, 2013, 11, 13-22
Published Online September 2013 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijigsp.2013.11.02
Copyright © 2013 MECS I.J. Image, Graphics and Signal Processing, 2013, 11, 13-22
Spectral Subtractive-Type Algorithms for
Enhancement of Noisy Speech: An Integrative
Review
Navneet Upadhyay
1
, Abhijit Karmakar
2
1
Department of Electrical & Electronics Engineering, Birla Institute of Technology and Science, Pilani 333031, India
2
Integrated Circuit Design Group, CSIR - Central Electronics Engineering Research Institute, Pilani 333031, India
e-mail: navneet_upd@rediffmail.com
1
, abhijit @ceeri.ernet.in
2
Abstract — The spectral subtraction method is a
classical approach for enhancement of speech degraded
by additive background noise. The basic principle of this
method is to estimate the short-time spectral magnitude
of speech by subtracting estimated noise spectrum from
the noisy speech spectrum. This is also achieved by
multiplying the noisy speech spectrum with a gain
function and later combining it with the phase of the
noisy speech. Besides reducing the background noise,
this method introduces an annoying perceptible tonal
characteristic in the enhanced speech and affects the
human listening, known as remnant musical noise.
Several variations and implementations of this method
have been adopted in past decades to address the
limitations of spectral subtraction method. These
variations constitute a family of subtractive-type
algorithms and operate in frequency domain. The
objective of this paper is to provide an extensive
overview of spectral subtractive-type algorithms for
enhancement of noisy speech. After the review, this
paper is concluded by mentioning a future direction of
speech enhancement research from spectral subtraction
perspective.
Index Terms — Speech enhancement, additive
background noise, noise estimation, spectral subtractive-
type algorithms, remnant musical noise
I. INTRODUCTION
Speech is one of the most prominent and primary
modes of interaction between human-to-human and
human-to-machine communication in various fields for
instance automatic speech recognition and speaker
identification [1]. The present day speech
communication systems are severely degraded due to
various types of unwanted random sound which make
the listening task difficult for a direct listener and cause
inaccurate transfer of information [2]. Therefore,
enhancement speech is one of the main motives of
various researching endeavors in the field of speech
processing over the past few decades. The main
objective of speech enhancement is to minimize the
degree of distortion of the desired speech signal and to
improve one or more perceptual aspects of speech, such
as the quality and/or intelligibility.
The quality of speech is a subjective measure which
reflects the way that the signal is perceived by listeners.
Intelligibility, on the other hand is an objective measure
of the amount of information that can be extracted by
listeners from the speech signal. These two measures are
uncorrelated and independent of each other. A speech
signal may be of high quality and low intelligibility and
vice-versa [1-4].
The classification of speech enhancement method
depends on the number of microphones that are used for
collecting speech data, into single, dual or multi-channel.
Although the multi-channel speech enhancement is
better than that of single channel speech enhancement
[1-2], yet the single channel speech enhancement is still
a significant field of researching because of its simple
implementation and ease of computation. Single channel
speech enhancement uses only one microphone to
collect noisy speech data [1-4].
The estimation of the spectral amplitude from the
noisy data is easier than estimate of both the amplitude
and phase. In [5-6], revealed that the short-time spectral
amplitude (STSA) is more important than the phase
information for the quality and intelligibility of speech.
Therefore, single channel speech enhancement is usually
divided into two classes based on the STSA estimation.
The first class applies subtractive-type algorithms and
attempt to estimate the short-time spectral magnitude
(STSM) of speech by subtracting the estimated noise
spectrum. Here, noise is estimated during speech pauses
[7-11, 13]. The other class applies a spectral subtraction
filter (SSF) to the noisy speech, so that the spectral
amplitude of enhanced speech can be obtained. The
design principle is to select appropriate parameters of
the filter to minimize the difference between the
enhanced speech and the clean speech signal [8].
In real-world listening environments, the speech is
mostly degraded by additive noises [5, 9-14]. Additive
noise is typically background noise which is
uncorrelated with the clean speech signal in nature like
white Gaussian noise (WGN), colored noise, multi-
talker (babble) noise. The background noise may be
stationary or non-stationary in nature. Therefore, the