2007 IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 24-27 November 2007, Dubai, United Arab Emirates
A Fixed Dimension Modified Sinusoid Model (FD-MSM) for
Single Microphone Sound Separation
Pejman Mowlaee Begzade Mahalel, Abolghasem Sayadiyan2, Alireza Bayesteh Tashk3
1
PhD student at Department of Electrical Engineering Amirkabir University of Technology,
2Associate Professor at Department of Electrical Engineering Amirkabir University of Technology
3Msc student at Department of Electrical Engineering Amirkabir University of Technology,
Dept. of Electrical Eng, 15875-4413, Hafez, Tehran, Iran
Emails: P
MolaeeAcic.
aut.acir eeas3 5
-cic.aut.acir
bayesteh ar(C_yahoo. cor
Abstract- The lack ofa flexible analysis model has been
introduced as an important issue in
different applications
like source separation. In this paper, a Fixed Dimension
Modified Sinusoid Model (FD-MSM) is proposed for
analysis of all audible signals consisting ofspeech, music
and their mixtures. Employing the peak picking in Mel-
domain gives rise to a fixed number ofparameters in the
proposed FDMSM, which is desired in clustering
algorithms like VQ (Vector Quantization) or GMM
(Gaussian Mixture Model), commonly used for source
separation scenarios. Applying the proposed FD-MSM to
various audible signals, it is observed that the resulting
signal is perceptually indistinguishable from the original.
Index Terms - Mel-scale, Sinusoidal model, phase
coherency.
1. INTRODUCTION
The lack of a flexible analysis model has been introduced
as a challenging problem in different applications
especially in speech enhancement and source separation.
Among the most commonly used analysis models used
for speech signals, generally two approaches have been
introduced so far: (1) The classical pitch excited linear
predictive coding (LPC) model [1], which represents the
speech production process in terms of a spectrally flat
excitation signal driving a slowly-varying vocal tract
filter. However, pitch-excited LPC is sensitive to errors in
pitch and voicing state estimation and as a result does not
work well for certain speakers. (2) Harmonic modeling, a
special case of sinusoidal model of speech first
introduced by Quatieri and McAulay [2]. Other sinewave
models have been discussed in the literature [3,4,10].
Independent of which approach is used for analysis of
speech signal, the envelope of the spectrum is known as a
key feature, which is generally obtained by method
proposed by Paul [5] or the one developed in [2]. In the
latter, first all peaks from Short-time Fourier Transform
(STFT) in the spectrum are marked, and then the peaks
whose occurrences are close to the pitch values and its
harmonics are held while the remaining peaks are
discarded. Note that using pitch detection algorithm
requires the pitch values for each speech segment which
grows the computational complexity. Moreover its not
capable for analysis of audible signals other than speech.
As another model, Analysis-by-Synthesis/Overlap-Add
(ABS/OLA) introduced by smith and George, has proven
successful in speech analysis and modification [6].
However, the computational load of analysis remains an
obstacle for real-time analysis due to its exhaustive
frequency searching. It also results in poor quality
synthesized speech when analysis window is too short.
Another recently proposed model called long-term
(LT) model provides a synthesis quality similar to the
one obtained with short-term interpolation of the
measured phases, but the unvoiced sections are not
considered, since they cannot be modeled efficiently by
the sinusoidal/ harmonic model [7].
All the above mentioned approaches are not capable
for source separation application for the following
reasons: (1) the number of features extracted in model are
not fixed and as a result may significantly deteriorate
Vector Quantization (VQ) or Gaussian Mixture Model
(GMM) clustering performance. (2) They are not capable
for a broad range of audible signals which is critical for
sound separation. However, in this paper a new
sinusoidal analysis model is introduced called Fixed
Dimension
Modified
Sinusoid Model (FD-MSM),
including the inputs other than speech signals, i.e. songs
and music or mixtures to illustrate the improvement of
the proposed model in terms of lower computational
complexity while preserving the quality of the original
signal as close as possible. Some ideas for reaching this
goal are presented in the following sections.
The paper is organized as follows: Section II presents a
brief review on previous sinusoidal models for speech
signals. In section III, several advantageous of our
proposed FD-MSM model will be declared. Section IV,
simulation results are presented. Section V concludes.
2. SINUSOIDAL MODELS FOR SPEECH SIGNALS
The process of speech production is similar to filtering
in the context of signal processing, where an excitation
signal, e(n), produced by the vocal cords, is filtered by
the vocal tract, h(n). The excitation signal is either nearly
an impulse train during voiced speech or noise during
unvoiced speech [1]. As a result, this filtering process can
be formulated in frequency domain as follows:
X(j))) [E(UW))xH(UW))] *W(j )) (1)
where X(jc), E(jc), H(jc), and W(jc) are the Fourier
transforms of the reconstructed speech signal, excitation,
1-4244-1 236-6/07/$25.00
©
2007 IEEE
1183
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on June 13,2010 at 09:08:51 UTC from IEEE Xplore. Restrictions apply.