SOURCE EXTRACTION FROM TWO-CHANNEL MIXTURES BY JOINT COSINE PACKET ANALYSIS Andrew Nesbit, Mike Davies, Mark Plumbley and Mark Sandler Centre for Digital Music Department of Electronic Engineering Queen Mary, University of London Mile End Road, London, E1 4NS, United Kingdom email: {Andrew.Nesbit,Mike.Davies,Mark.Plumbley,Mark.Sandler}@elec.qmul.ac.uk ABSTRACT This paper describes novel, computationally efficient ap- proaches to source separation of underdetermined instanta- neous two-channel mixtures. A best basis algorithm is ap- plied to trees of local cosine bases to determine a sparse transform. We assume that the mixing parameters are known and focus on demixing sources by binary time-frequency masking. We describe a method for deriving a best local cosine ba- sis from the mixtures by minimising an l 1 norm cost function. This basis is adapted to the input of the masking process. Then, we investigate how to increase sparsity by adapting local cosine bases to the expected output of a single source instead of to the input mixtures. The heuristically derived cost function maximises the energy of the transform coeffi- cients associated with a particular direction. Experiments on a mixture of four musical instruments are performed, and re- sults are compared. It is shown that local cosine bases can give better results than fixed-basis representations. 1. INTRODUCTION Blind source separation is a broad term which describes a set of techniques which aim to estimate individual sources from a number of observed mixtures of those source signals. Cases in which the number of mixtures is greater than, or equal to, the number of sources are called (over-)determined. These cases have been well studied, commonly through the application of independent component analysis (ICA) [5]. In contrast to the overdetermined case, underdetermined blind source separation considers cases in which there are more sources than mixtures. In this work, we deal with un- derdetermined, instantaneous, two-channel mixtures of n > 2 time-domain audio sources: x 1 x 2 = a 11 ··· a 1n a 21 ··· a 2n s 1 . . . s n (1) where s j is the jth source, x i is the ith mixture, a ij is the positive real amplitude (mixing parameter) of the jth source in the ith mixture (observation), and 1 j n and i = 1, 2. A mixture model given by Equation 1 may represent, for example, a music signal with a conventional “pan-potted Andrew Nesbit is supported by a research grant from the Semantic Inter- action with Music Audio Contents (SIMAC) project (EU-FP6-IST-507142), and by the Department of Electronic Engineering, Queen Mary, University of London. stereo” mixing method. Indeed, our experiments will con- centrate on mixtures in which each source is a musical in- strument. The blind source separation problem may be split con- ceptually into two successive subproblems [10]. Identifica- tion is the first, and involves determining the mixing param- eters a ij . Once the mixing parameters are known, the sec- ond subproblem, filtering, involves separating each source s j from the mixtures to yield an estimated source ˆ s j . The de- generate unmixing estimation technique (DUET) [11] pro- vides an example of this partitioning into subproblems: The mixing parameters are identified by constructing a histogram from which the values may be read. Once this has been done, the sources are estimated by time-frequency masking (Sec- tion 2). DUET is one method which may be applied to mix- tures in the form of Equation 1. (DUET was originally de- veloped for blind source separation of anechoic mixtures, in which the mixture may include relative delays as well as rel- ative amplitude gains. Instantaneous mixtures are a special case of anechoic mixtures—simply set all relative delays to zero—and so DUET may be used as stated.) The current paper concentrates on the filtering phase. We assume that the mixing parameters are known or have been estimated. As such, these methods are equally applicable to other non-blind scenarios, in which the mixing parameters are known. Section 2 describes filtering by time-frequency masking. Subsequently, Section 3 describes computationally efficient methods for adapting time-frequency representations to try to match the time-varying signal characterstics better. These methods apply the best basis algorithm [3] to a tree of local cosine bases. In Section 4, we compare and contrast differ- ent techniques and representations: The short-time Fourier transform (STFT, which lies at the heart of the filtering stage in DUET), the modified discrete cosine transform (MDCT), a best local cosine basis derived from a mixture of the sources, and a best local cosine basis which sparsifies the representa- tion at the output of the filtering process. 2. TIME-FREQUENCY MASKING Consider a real- or complex-valued linear transform T ap- plied to the mixtures x 1 and x 2 in Equation 1. This gives transformed mixtures ˜ x 1 = Tx 1 and ˜ x 2 = Tx 2 with the same mixing structure as Equation 1. A sparse transform has most coefficients very close to zero and only a few large coefficients. This will represent the mixtures in the desired way, such that the sources have (approximately) disjoint support in the transform domain.