SINGLE-CHANNEL SOURCE SEPARATION OF SPEECH AND MUSIC USING SHORT-TIME SPECTRAL KURTOSIS Yevgeni Litvin 1 , Israel Cohen 1 , and Jacob Benesty 2 1 Department of Electrical Engineering 2 INRS-EMT Technion - Israel Institute of Technology Universite du Quebec Technion City, Haifa 32000, Israel Montreal, QC H5A 1K6, Canada {elitvin@tx, icohen@ee}.technion.ac.il benesty@emt.inrs.ca ABSTRACT In this paper, the problem of blind monaural speech/music source separation is addressed using short-time spectral kurtosis (STSK). An estimator for STSK is introduced, and a source separation algo- rithm is formulated that relies on the spectral kurtosis differences of distinct signal classes. The performance of the proposed algo- rithm is evaluated on mixtures of speech signals and various types of music signals. The results are compared to those obtained by a competing monaural source separation algorithm, which is based on a Gaussian mixture model (GMM). 1. INTRODUCTION High order statistics are frequently used in the task of multichannel source separation. In particular, kurtosis is used as a measure of non-Gaussianity of the recovered mixture components. Spectral kurtosis (SK) is a tool capable of locating non-Gaussian compo- nents including their location in the frequency domain. SK was ﬁrst introduced by Dwyer [1]. He deﬁned it as a kurtosis value of the real part of the STFT ﬁlterbank output. Antoni [2] introduced a different formalization of the SK by means of Wold-Cramér de- composition which gave a theoretical ground for the estimation of the SK of non-stationary processes. He also showed practical ap- plications of his approach in the ﬁeld of machine surveillance and diagnostics [3, 4]. Other applications of spectral kurtosis include SNR estimation in speech signals [5], denoising [6], and subter- ranean termite detection [7]. In this paper we show how the SK of a mixture relates to the SK of its components. We deﬁne the short time spectral kurtosis (STSK) as a time localized version of the SK. We deﬁne a simple STSK estimator and show its application in the task of speech and music monaural separation. The proposed algorithm uses STSK analysis to assign time-frequency bins of a mixture to the correct source. A binary mask is used to reject the interfering source in the STFT domain. In the experimental results we study the separa- tion performance of the proposed algorithm on mixtures of speech and musical excerpts played by various instruments. We show im- proved performance of the proposed algorithm compared to a com- peting GMM based algorithm [8]. The remainder of this paper is structured as follows. In Sec- tion 2 we present the concept of SK. Section 3 extends the idea of This work was supported by the Israel Science Foundation under Grant 1085/05 and by the European Commission under project Memories FP6- IST-035300. SK to non-stationary signals. In Section 4 we describe a simple source separation algorithm based on the SK analysis. An experi- mental study is given in Section 5, followed by a short discussion in Section 6. 2. SPECTRAL KURTOSIS In this section, we present the SK. We analytically evaluate the SK for some common probability distributions, and show how the SK of an instantaneous mixture relates to the SK of its components. Let x (n) be a real, discrete time, stationary random vector. Let X k be its N points discrete Fourier transform (DFT) deﬁned by: X k = N−1 ∑ n=0 x (n) e −j 2π N kn , (1) where k is the frequency index. Due the circularity of X k and following the reasoning in [9], the only way to deﬁne a spectral kurtosis for x (n) that does not vanish is K x (k) = κ 4 {X k ,X ∗ k ,X k ,X ∗ k } (κ2 {X k ,X ∗ k }) 2 , (2) with κr being an r-th order cumulant. Using the circularity the deﬁnition can be simpliﬁed to: Kx (k) = E { |X k | 4 } ( E { |X k | 2 }) 2 − 2. (3) Let x WG (n) be a white Gaussian signal. Its DFT is a complex normally distributed vector. All cumulants of an order greater than 3 are zero for Gaussian and complex Gaussian random variables. By eq. (2) the SK of x WG (n) is zero for all k. Let a be an amplitude and m0 a frequency index. Let xsine (n)= ae j(2π m 0 N n+ϕ) . If ϕ ∼ U (0, 2π), xsine (n) is a stationary pro- cess. We note that E { |X k | 4 } = ( E { |X k | 2 }) 2 =(Na) 4 . It follows that Kx sine (k)= −1. In this work we use the instantaneous mixture model: x (n)= s1 (n)+ s2 (n) . (4) Assume that s 1 (n) and s 2 (n) are statistically independent sta- tionary processes. Let φ s (k)  E ( |S k | 2 ) and γ (k)  φ s 1 (k) /φ s 2 (k)