Audio Engineering Society Convention Paper Presented at the 120th Convention 2006 May 20–23 Paris, France This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Signal Analysis Using the Complex Spectral Phase Evolution (CSPE) Method Kevin M. Short 1 and Ricardo A. Garcia 2 1 Chaoticom Technologies, Andover, MA, 01810, USA kevin @ chaoticom . com 2 Chaoticom Technologies, Andover, MA, 01810, USA rago @ chaoticom . com ABSTRACT The Complex Spectral Phase Evolution (CSPE) method is introduced as a tool to analyze and detect the presence of short-term stable sinusoidal components in an audio signal. The method provides for super-resolution of frequencies by examining the evolution of the phase of the complex signal spectrum over time-shifted windows. It is shown that this analysis, when applied to a sinusoidal signal component, allows for the resolution of the true signal frequency with orders of magnitude greater accuracy than the DFT. Further, this frequency estimate is independent of the frequency bin and can be estimated from “leakage” bins far from spectral peaks. The method is robust in the presence of noise or nearby signal components, and is a fundamental tool in the front-end processing for the KOZ compression technology. 1. INTRODUCTION Many audio applications require signals to be represented as a discrete sum of sinusoidal components. In most transform-based processing techniques, audio is decomposed through the use of the Discrete Fourier transform (DFT) or the Fast Fourier transform (FFT) applied to windows of N data samples. While these transforms are one-to-one on discretely-sampled data, they suffer from several limitations that add to the complexity of the representation of the audio signal as a sum of sinusoidal components. One such characteristic is that the sinusoidal elements are all of integer periods over the N samples, whereas real signals have no such limitation. This causes the DFT and FFT to have a limited frequency resolution equal to the sampling rate divided by N. In the particular application of interest in this paper, a more efficient representation is desired for compressed audio. In compression applications, the front end processing often requires modeling the sound as a decomposition into sinusoids plus transients plus noise [1][2]. Most of these methods depend on a mechanism for estimating a