1 Computational analysis of Maqam music: from audio transcription to musicological analysis, everything is tightly intertwined Olivier Lartillot Swiss Center for Affective Sciences, University of Geneva 7, rue des Battoirs CH-1205 GENEVA Abstract: Automated transcription of audio recordings into musical scores is a very challenging problem. Robust technological solutions are so far limited to simple cases and specific conditions, such as the focus on specific tractable musical instruments. The traditional conception of transcription as the inference of a single layer of notes ignores one core characterization of music as a multi-layer encapsulation of events of various scales (notes, gestures, motifs, phrases, etc.), where higher-level structures contextually guide the progressive discovery of lower-level elements. Modeling the emergence of these multiple structural layers, although complicating the problem, is in our view the only way to obtain a robust automation of music transcription, which is modeled here in the form of a multi-layer and recursive auditory scene analysis. Additionally, culture, as the experience of previous similar types of music, plays another essential role in guiding the more ambiguous aspects of music understanding. A previously proposed modeling of the impact of culture in structural understanding, applied in particular to Arabic Maqam music, is generalized here to the study of the influence of such cultural knowledge on music analysis, and in particular on the lowest layers of note transcription. Keywords: Transcription, Music analysis, Maqam 1. Introduction The aim of music transcription is to extract elementary musical events (such as notes) from the raw audio signal, and to characterize these events with respect to their temporal locations and durations in the signal, their pitch heights, dynamics, but also to organize these notes into streams related to particular musical instruments and registers in particular, to integrate the notes in an underlying metrical structure, to indicate salient motivic configurations, etc. Computational techniques to detect these events are based on three main strategies. A first strategy consists in detecting saliencies in the temporal evolution of the energy of the signal. This method does not work in general cases, quite common in music, where each note already features significant temporal modulation in energy (such as vibrato) or when series of notes are thread into global gestures where the transition between notes is not articulated in terms of dynamics. An alternative consists in observing more in details the spectral evolution, and in particular in detecting significant dissimilarities between successive frames with respect to their general spectral distributions. Yet still global comparisons frame by frame cannot generally discriminate properly between spectral discontinuities that are intrinsic to the dynamic of a single note and those that would relate to transition between notes. A robust and generalist approach for note events detection requires a more careful analysis of the audio content, related in particular to the temporal evolution of the note pitch heights and to the inference, from this continuous representation, of periods of stability in pitch height corresponding to the notes. This study is illustrated with a particular example of traditional Tunisian maqam music, using a