TOP-DOWN STRATEGIES IN PARAMETER SELECTION OF SINUSOIDAL MODELING OF AUDIO Toni Hirvonen and Athanasios Mouchtaris Department of Computer Science, University of Crete, and Institute of Computer Science Foundation for Research and Technology - Hellas (FORTH-ICS) Heraklion, Crete, Greece ABSTRACT Sinusoidal modeling of audio requires the model parameters to be selected by analyzing the original signal spectrum. This paper proposes two improvements in sinusoidal selection by consider- ing how psychoacoustic masking curves can be calculated using a top-down strategy in certain situations. First, a non-iterative com- ponent selection method to be used in combination with an added residual signal is presented. Tests indicate computational gain and quality increase when the method is used with a noise-synthesized residual. Secondly, the estimation of the masking curve in binaural listening when signals are panned is considered. Tests show that knowledge of the degree of panning is beneſcial when heavy pan- ning is applied to simultaneously rendered audio object signals. Index Terms— audio coding, sinusoidal modeling, psychoa- coustic masking 1. INTRODUCTION Sinusoidal modeling [1] of audio is one of the most popular para- metric audio modeling methods, since it has the capacity to rep- resent an audio signal with good quality by only modeling a rela- tively small number of spectral components. Some types of sounds cannot be accurately represented by the sinusoidal model. For these cases, an additional component is included (residual part), which models the sinusoidal error signal, i.e. the difference be- tween the actual signal and its modeled version [2]. In sinusoidal modeling, energetic masking (due to the human auditory system) is utilized to determine the frequencies of the most perceptually important components [3]. This is usually done in an iterative manner; after selecting one component, the resid- ual magnitude spectrum and the masking curve are updated. At each step, the component frequency that minimizes a perceptual distortion measure is selected. The remaining model parameters (sinusoidal amplitudes and phases) are estimated from the original signal after the frequency selection. State-of-art approaches for sinusoidal selection such as [3] can be thought of as implementing a bottom-up approach, where no information of the signal reconstruction model or playback condi- tions are exploited. At each step, the method maximizes the energy This work has been funded in part by the Marie Curie TOK “ASPIRE” grant, and in part by the PEOPLE-IAPP “AVID-MODE” grant, within the 6th and 7th European Community Framework Programs respectively. of the spectrum that is covered by the masker of the new compo- nent. No additional criteria, such as naturalness due to the use of a residual model are considered in this process. The purpose of this paper is to reſne the sinusoidal parameter selection process to be more ſtting to certain applications in a more top-down manner. The term “top-dow” in this paper implies that instead of using the processing tools independently, we introduce a holistic approach of the reproduction process and conditions that can be used to al- ter the methods in a way which is beneſcial for these particular conditions. This paper proposes two contributions regarding the frequency selection in the sinusoidal model: (a) a non-iterative process for estimating the perceptually important frequency com- ponents, and (b) masking curve estimation when multiple signals are to be panned before reproduction. The former contribution is useful when a residual signal is used. In this case, the synthesized energy is close to that of the original signal. Consequently, we show that our non-iterative method for sinusoidal component selection offers an improved sound quality compared to current iterative methods, besides the added computational efſciency. The latter contribution indicates how the sinusoidal selection must be implemented in cases when the audio signals are modeled before mixing occurs. The signif- icance of this result relates to the current efforts in the upcoming MPEG Spatial Audio Object Coding (SAOC) standard [4] and to the possibility of applying the sinusoidal model in this context (see for example [5]). In SAOC, the goal is to encode multiple audio signals before they are mixed into a stereo or multichannel repro- duction. This offers the advantage of mixing at the decoder, which is expected to enable a variety of interactive audio applications. 2. NON-ITERATIVE COMPONENT SELECTION This section discusses an improved method for component selec- tion in the sinusoidal model, for the case when using an additive residual signal. Unlike in Section 3, modeling of single-channel audio is considered in this section. 2.1. Psychoacoustic Sinusoidal Matching Pursuit Current state-of-the-art methods employ perceptual matching pur- suit algorithms to determine the sinusoidal parameters of each frame. In [3], an improved frequency masking model was com- bined with Psychoacoustical Matching Pursuit (PAMP). At each iteration i PAMP minimizes the perceptual distortion measure D 273 978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010