2007 IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 24-27 November 2007, Dubai, United Arab Emirates A Fixed Dimension Modified Sinusoid Model (FD-MSM) for Single Microphone Sound Separation Pejman Mowlaee Begzade Mahalel, Abolghasem Sayadiyan2, Alireza Bayesteh Tashk3 1 PhD student at Department of Electrical Engineering Amirkabir University of Technology, 2Associate Professor at Department of Electrical Engineering Amirkabir University of Technology 3Msc student at Department of Electrical Engineering Amirkabir University of Technology, Dept. of Electrical Eng, 15875-4413, Hafez, Tehran, Iran Emails: P MolaeeAcic. aut.acir eeas3 5 -cic.aut.acir bayesteh ar(C_yahoo. cor Abstract- The lack ofa flexible analysis model has been introduced as an important issue in different applications like source separation. In this paper, a Fixed Dimension Modified Sinusoid Model (FD-MSM) is proposed for analysis of all audible signals consisting ofspeech, music and their mixtures. Employing the peak picking in Mel- domain gives rise to a fixed number ofparameters in the proposed FDMSM, which is desired in clustering algorithms like VQ (Vector Quantization) or GMM (Gaussian Mixture Model), commonly used for source separation scenarios. Applying the proposed FD-MSM to various audible signals, it is observed that the resulting signal is perceptually indistinguishable from the original. Index Terms - Mel-scale, Sinusoidal model, phase coherency. 1. INTRODUCTION The lack of a flexible analysis model has been introduced as a challenging problem in different applications especially in speech enhancement and source separation. Among the most commonly used analysis models used for speech signals, generally two approaches have been introduced so far: (1) The classical pitch excited linear predictive coding (LPC) model [1], which represents the speech production process in terms of a spectrally flat excitation signal driving a slowly-varying vocal tract filter. However, pitch-excited LPC is sensitive to errors in pitch and voicing state estimation and as a result does not work well for certain speakers. (2) Harmonic modeling, a special case of sinusoidal model of speech first introduced by Quatieri and McAulay [2]. Other sinewave models have been discussed in the literature [3,4,10]. Independent of which approach is used for analysis of speech signal, the envelope of the spectrum is known as a key feature, which is generally obtained by method proposed by Paul [5] or the one developed in [2]. In the latter, first all peaks from Short-time Fourier Transform (STFT) in the spectrum are marked, and then the peaks whose occurrences are close to the pitch values and its harmonics are held while the remaining peaks are discarded. Note that using pitch detection algorithm requires the pitch values for each speech segment which grows the computational complexity. Moreover its not capable for analysis of audible signals other than speech. As another model, Analysis-by-Synthesis/Overlap-Add (ABS/OLA) introduced by smith and George, has proven successful in speech analysis and modification [6]. However, the computational load of analysis remains an obstacle for real-time analysis due to its exhaustive frequency searching. It also results in poor quality synthesized speech when analysis window is too short. Another recently proposed model called long-term (LT) model provides a synthesis quality similar to the one obtained with short-term interpolation of the measured phases, but the unvoiced sections are not considered, since they cannot be modeled efficiently by the sinusoidal/ harmonic model [7]. All the above mentioned approaches are not capable for source separation application for the following reasons: (1) the number of features extracted in model are not fixed and as a result may significantly deteriorate Vector Quantization (VQ) or Gaussian Mixture Model (GMM) clustering performance. (2) They are not capable for a broad range of audible signals which is critical for sound separation. However, in this paper a new sinusoidal analysis model is introduced called Fixed Dimension Modified Sinusoid Model (FD-MSM), including the inputs other than speech signals, i.e. songs and music or mixtures to illustrate the improvement of the proposed model in terms of lower computational complexity while preserving the quality of the original signal as close as possible. Some ideas for reaching this goal are presented in the following sections. The paper is organized as follows: Section II presents a brief review on previous sinusoidal models for speech signals. In section III, several advantageous of our proposed FD-MSM model will be declared. Section IV, simulation results are presented. Section V concludes. 2. SINUSOIDAL MODELS FOR SPEECH SIGNALS The process of speech production is similar to filtering in the context of signal processing, where an excitation signal, e(n), produced by the vocal cords, is filtered by the vocal tract, h(n). The excitation signal is either nearly an impulse train during voiced speech or noise during unvoiced speech [1]. As a result, this filtering process can be formulated in frequency domain as follows: X(j))) [E(UW))xH(UW))] *W(j )) (1) where X(jc), E(jc), H(jc), and W(jc) are the Fourier transforms of the reconstructed speech signal, excitation, 1-4244-1 236-6/07/$25.00 © 2007 IEEE 1183 Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on June 13,2010 at 09:08:51 UTC from IEEE Xplore. Restrictions apply.