CATALOG-BASED SINGLE-CHANNEL SPEECH-MUSIC SEPARATION FOR
AUTOMATIC SPEECH RECOGNITION
Cemil Demir
1,3
, A. Taylan Cemgil
2
, Murat Sarac ¸lar
3
1
T
¨
UB
˙
ITAK-B
˙
ILGEM, Kocaeli, Turkey
2
Computer Engineering Department, Bo˘ gazic ¸i University, Istanbul,Turkey
3
Electrical and Electronics Engineering Department, Bo˘ gazic ¸i University,
˙
Istanbul,Turkey
cdemir@tubitak.uekae.gov.tr, (taylan.cemgil|murat.saraclar)@boun.edu.tr
ABSTRACT
In this study, we analyze the effect of the catalog-based
single-channel speech-music separation method, which we
proposed previously, on speech recognition performance. In
the proposed method, assuming that we know a catalog of
the background music, we developed a generative model for
the superposed speech and music spectrograms. We repre-
sent the speech spectrogram by a Non-negative Matrix Fac-
torization (NMF) model and the music spectrogram by a con-
ditional Poisson Mixture Model (PMM). In this paper, we
propose to recover the speech signals from the mixed sig-
nal in time-domain by detecting the active catalog frames us-
ing the catalog-based method. We compare the performances
of 3 different signal reconstruction techniques; Expectation-
Based, Posterior-Based and Time-Domain reconstruction.
Moreover, we compare the performance of our system with
the performance of the traditional NMF model. Our method
outperforms the NMF method in ASR performance and sep-
aration performance in most experimental conditions.
1. INTRODUCTION
Recently automatic speech recognition (ASR) applications
have become popular in broadcast news transcription sys-
tems. One major problem is the serious drop in the perfor-
mance with the presence of background music, that is often
present in radio and television broadcasts [1, 2]. Therefore,
removing the background music is important for developing
robust ASR systems. A real-world ASR solution should con-
tain a front-end system capable of segmenting and separating
music and speech from incoming audio signals. The aim of
this study is to analyze the performance of the catalog-based
speech-music separation method, that we proposed previ-
ously, when it is used as a front-end for an ASR system.
Many researchers studied single-channel source separa-
tion for mixture of speech from two speakers [3] but there
are a few studies on single-channel speech-music separa-
tion [4, 5, 6]. Model-based approaches are used to separate
sound mixtures that contain the same class of sources such
as speech from different people [7, 8] or music from differ-
ent instruments [9, 10].
In a previous study [11] , we introduced a simple prob-
abilistic model-based approach to separate speech from mu-
sic. Unlike other probabilistic approaches, we do not model
the speech in great detail, but instead focus on a model for
the music. The motivation behind our approach is that, es-
pecially in broadcast news, most of the time, the background
music is composed of some repetitive piece of music, called
a ’jingle’. Therefore, we can assume that we can learn a
catalog of these jingles and hope to improve separation per-
formance.
In our model, the catalog corresponds to a conditional
mixture model. Each spectrogram frame of the music is gen-
erated by a single mixture component, i.e., a catalog element.
The speech spectrogram is generated from an NMF model.
The observed spectrogram is the sum of the speech and mu-
sic. Separation is achieved by joint estimation of the un-
known parameters and hidden variables of this hierarchical
model.
We assume that, although we do not have any prior infor-
mation about the speech part of the mixture, we can assume
that the magnitude spectrogram of the speech signal is gener-
ated by an Non-Negative Matrix Factorization (NMF) model.
This way, by finding the parameters of the NMF model, we
can recover the speech signal from the mixture. We use the
probabilistic interpretation of the NMF to develop the sepa-
ration algorithm [12]. The reason for using the probabilis-
tic approach is that we can easily extend the model so that
it contains the prior information about the sources. For the
time being, we assume that the music is created by playing
a random part of a known clip and applying a frequency and
volume adjustment filters to change the character of the mu-
sic. Our aim is to find out which part of the clip is played
when, while figuring out the values of parameters of adjust-
ment filters. This corresponds to a Poisson Mixture Model
(PMM) for the magnitude spectrogram of the music signal.
The overall model consists of the combination of the NMF
model for the speech part and the mixture model for the mu-
sic part of the audio signal. In this study, we developed the
inference method for this overall probabilistic model and ap-
ply this separation method to increase the ASR performance.
Unlike the previous study [11], we use the catalog-
based method as a front-end for the ASR task and mea-
sured the separation performance with ASR evaluation crite-
ria, Word Error Rate (WER). Moreover, speech-music sepa-
ration performance of the proposed method is compared with
the separation performance of the traditional NMF based
method. Furthermore, the effect of reconstruction tech-
niques, Expectation-based and Posterior-based techniques,
are examined and the superiority of Posterior-based approach
is observed experimentally. Time-domain reconstruction
technique which can be used in the case of the original ver-
sion of the jingle is accessible in the separation phase is also
proposed and evaluated in this study.
This paper is organized as follows: in Section 2, we
overview the catalog-based and NMF based speech-music
separation methods. In Section 3, we briefly explain 3 dif-
ferent speech reconstruction techniques using source sepa-
19th European Signal Processing Conference (EUSIPCO 2011) Barcelona, Spain, August 29 - September 2, 2011
© EURASIP, 2011 - ISSN 2076-1465 2133