2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 21-24, 2007, New Paltz, NY USING STEREO INFORMATION FOR INSTRUMENT IDENTIFICATION IN POLYPHONIC MIXTURES David Sodoyer 1 1 University Pierre et Marie Curie - Paris 6 Institut Jean Le Rond D’Alembert, LAM team 11, rue de Lourmel, 75015 Paris, France [name]@lam.jussieu.fr Pierre Leveau 1,2 , Laurent Daudet 1 2 GET-T´ el´ ecom Paris (ENST) TSI department 46, rue Dareau, 75014 Paris, France [firstname].[lastname]@enst.fr ABSTRACT This paper discusses the localization of music instruments in the stereo space. The signal, composed of two channels, is decom- posed into a linear combination of Stereo Instrument-Specific Har- monic atoms, that model the harmonic structure of instrument notes as a whole and whose individual angles give clues about the real angle of the sources. To get such decompositions, a Stereo Match- ing Pursuit algorithm has been implemented, with a phase adapta- tion for each signal channel. This decomposition gives neat source localizations for instantaneous mixes, and the extension to realistic convolutive mixes seems possible with adequate post-processing. 1. INTRODUCTION A large number of digital sound archives are composed of stereo recordings: Music Information Retrieval (MIR) algorithms should exploit the additional information given by the spatial separation of sources to better identify them, as the human ear does. How- ever, current systems for MIR make little use of spatial attributes. A few studies can however be mentioned: Vincent [1] identifies instruments, notes and localizations using instrument models in a Bayesian framework, Sakuraba [2] takes the level differences be- tween each channel as a feature to separate melodic lines in poly- phonic music. Studies in other domains use the stereo aspect, for example in speech recognition [3]. On non-instantaneous mixes (for instance on live acoustic record- ings using a stereo pair), extracting spatial information without pri- ors on the sources can only lead to limited results, mostly due to reverberation effects. The goal of this paper is to explore spatial attributes of stereo files using explicit source models. This pro- vides, for a limited class of harmonic instruments, a number of representations that can be considered as mid-level between the plain signal features and the high-level features (symbolic or se- mantic). We emphasize here its use for instrument recognition in polyphonic mixtures, although we believe that similar methods can be useful for a large number of classical MIR tasks. The signal models that are used here have already been used in the mono-channel case [4, 5]: for each considered instrument, harmonic series (relative amplitude of partials) are learned on iso- lated sources, at every pitch and for a number of different expres- sive playing techniques. These amplitudes are then considered as parameters of short waveforms, called atoms. The problem is then restated as follows : given a recording, how can we identify the most likely set of harmonic groups that explain the mixture? This general sparse identification problem cannot be tackled directly with exhaustive searches for combinatorial reasons; we therefore make use of sub-optimal but tractable methods such as Matching Pursuit algorithm [6]. On mono files, we have shown that the de- composition of music signals with these atoms provides relevant mid-level representations, giving access to the source identity in the time-pitch domain. This system is however prone to errors that may have unfortunate consequences for subsequent processing. An interesting feature of the Matching Pursuit algorithm is that it has a straightforward extension for multichannel signals [7]. Performing a similar analysis on stereo channels gives us now ac- cess to another dimension, namely the angular position of each extracted atom. In this paper, we show how these time-pitch-angle representations can help us understand more accurately the con- tent of a recording. One has to distinguish between recordings made with instantaneous mixtures (“pan-pot”) of separated tracks, and real acoustic recordings using a stereo pair, the latter being by far more difficult to tackle. However the mid-level representation can be seen as a first step towards source separation. This paper is constructed as follows. The problem of music instrument localization is first stated in section 2, then the signal model, the decomposition algorithm and the learning of atom pa- rameters are detailed respectively in sections 3, 4 and 5. Then, in section 6, the whole system is tested on stereo mixes, and we conclude by drawing some perspectives for instrument characteri- zation tasks. 2. PROBLEM DEFINITION In the anechoic case, the stereo recording signal of one instrument xm(t) can be modeled as a pair of signals xst(t) = [cos(θ)xm(t) sin(θ)xm(t − τ )] T (1) where θ is a pan-pot parameter locating the source in the stereo space and τ a delay parameter between channels. In the case of a real acoustic recording, the stereo signal is the result of the convo- lution of source signals and source-microphones-specific impulse responses: xst (t)=[ψ l (t) ∗ xm(t) ψr (t) ∗ xm(t − τ )] T (2) Following room acoustics considerations [8], these impulse re- sponses can be considered as the result of the contribution of a large number I of virtual sources, having each a pan-pot parame- ter θi , and that emit at a delay ti after the original emission. Thus we can rewrite the impulse response of the stereo filter hst (t) as