CATALOG-BASED SINGLE-CHANNEL SPEECH-MUSIC SEPARATION FOR AUTOMATIC SPEECH RECOGNITION Cemil Demir 1,3 , A. Taylan Cemgil 2 , Murat Sarac ¸lar 3 1 T ¨ UB ˙ ITAK-B ˙ ILGEM, Kocaeli, Turkey 2 Computer Engineering Department, Bo˘ gazic ¸i University, Istanbul,Turkey 3 Electrical and Electronics Engineering Department, Bo˘ gazic ¸i University, ˙ Istanbul,Turkey cdemir@tubitak.uekae.gov.tr, (taylan.cemgil|murat.saraclar)@boun.edu.tr ABSTRACT In this study, we analyze the effect of the catalog-based single-channel speech-music separation method, which we proposed previously, on speech recognition performance. In the proposed method, assuming that we know a catalog of the background music, we developed a generative model for the superposed speech and music spectrograms. We repre- sent the speech spectrogram by a Non-negative Matrix Fac- torization (NMF) model and the music spectrogram by a con- ditional Poisson Mixture Model (PMM). In this paper, we propose to recover the speech signals from the mixed sig- nal in time-domain by detecting the active catalog frames us- ing the catalog-based method. We compare the performances of 3 different signal reconstruction techniques; Expectation- Based, Posterior-Based and Time-Domain reconstruction. Moreover, we compare the performance of our system with the performance of the traditional NMF model. Our method outperforms the NMF method in ASR performance and sep- aration performance in most experimental conditions. 1. INTRODUCTION Recently automatic speech recognition (ASR) applications have become popular in broadcast news transcription sys- tems. One major problem is the serious drop in the perfor- mance with the presence of background music, that is often present in radio and television broadcasts [1, 2]. Therefore, removing the background music is important for developing robust ASR systems. A real-world ASR solution should con- tain a front-end system capable of segmenting and separating music and speech from incoming audio signals. The aim of this study is to analyze the performance of the catalog-based speech-music separation method, that we proposed previ- ously, when it is used as a front-end for an ASR system. Many researchers studied single-channel source separa- tion for mixture of speech from two speakers [3] but there are a few studies on single-channel speech-music separa- tion [4, 5, 6]. Model-based approaches are used to separate sound mixtures that contain the same class of sources such as speech from different people [7, 8] or music from differ- ent instruments [9, 10]. In a previous study [11] , we introduced a simple prob- abilistic model-based approach to separate speech from mu- sic. Unlike other probabilistic approaches, we do not model the speech in great detail, but instead focus on a model for the music. The motivation behind our approach is that, es- pecially in broadcast news, most of the time, the background music is composed of some repetitive piece of music, called a ’jingle’. Therefore, we can assume that we can learn a catalog of these jingles and hope to improve separation per- formance. In our model, the catalog corresponds to a conditional mixture model. Each spectrogram frame of the music is gen- erated by a single mixture component, i.e., a catalog element. The speech spectrogram is generated from an NMF model. The observed spectrogram is the sum of the speech and mu- sic. Separation is achieved by joint estimation of the un- known parameters and hidden variables of this hierarchical model. We assume that, although we do not have any prior infor- mation about the speech part of the mixture, we can assume that the magnitude spectrogram of the speech signal is gener- ated by an Non-Negative Matrix Factorization (NMF) model. This way, by ﬁnding the parameters of the NMF model, we can recover the speech signal from the mixture. We use the probabilistic interpretation of the NMF to develop the sepa- ration algorithm [12]. The reason for using the probabilis- tic approach is that we can easily extend the model so that it contains the prior information about the sources. For the time being, we assume that the music is created by playing a random part of a known clip and applying a frequency and volume adjustment ﬁlters to change the character of the mu- sic. Our aim is to ﬁnd out which part of the clip is played when, while ﬁguring out the values of parameters of adjust- ment ﬁlters. This corresponds to a Poisson Mixture Model (PMM) for the magnitude spectrogram of the music signal. The overall model consists of the combination of the NMF model for the speech part and the mixture model for the mu- sic part of the audio signal. In this study, we developed the inference method for this overall probabilistic model and ap- ply this separation method to increase the ASR performance. Unlike the previous study [11], we use the catalog- based method as a front-end for the ASR task and mea- sured the separation performance with ASR evaluation crite- ria, Word Error Rate (WER). Moreover, speech-music sepa- ration performance of the proposed method is compared with the separation performance of the traditional NMF based method. Furthermore, the effect of reconstruction tech- niques, Expectation-based and Posterior-based techniques, are examined and the superiority of Posterior-based approach is observed experimentally. Time-domain reconstruction technique which can be used in the case of the original ver- sion of the jingle is accessible in the separation phase is also proposed and evaluated in this study. This paper is organized as follows: in Section 2, we overview the catalog-based and NMF based speech-music separation methods. In Section 3, we brieﬂy explain 3 dif- ferent speech reconstruction techniques using source sepa- 19th European Signal Processing Conference (EUSIPCO 2011) Barcelona, Spain, August 29 - September 2, 2011 © EURASIP, 2011 - ISSN 2076-1465 2133