The 21 th International Conference on Auditory Display (ICAD–2015) July 8-10, 2015, Graz, Austria THE LOUD BIRD DOESN’T (ALWAYS) GET THE WORM: WHY COMPUTATIONAL SALIENCE ALSO NEEDS BRIGHTNESS AND TEMPO Francesco Tordini*, Albert S. Bregman, and Jeremy R. Cooperstock McGill University Montr´ eal, QC, Canada ABSTRACT Salience shapes the involuntary perception of a sound scene into foreground and background. A computational model of salience would provide a strong perceptual baseline for the soniﬁcation de- signer. However, there is a lack of ground truth to evaluate the proposed models and to measure their performance with respect to human perception. This paper describes three contributions. First, we introduce a behavioral deﬁnition of salience. We describe an experiment based on our deﬁnition that tests a corpus of natural communication sounds. Our results suggest that salience is well described by three perceptual dimensions: not only loudness, but also, tempo and brightness. Second, we extract the most signiﬁ- cant acoustical features and analyze their relation with salience, as measured by our ground truth. The context effects emerging from our analysis conﬁrm the difference between salience and novelty. Finally, we suggest some necessary characteristics of the compu- tational salience model based on the analyzed features. 1. INTRODUCTION The design of auditory displays, such as warning systems and mo- bile assistive technologies, must deal with information delivery us- ing sound, management of attention, and salience. Our long-term objective is to create a tool that assists in sound scene design by predicting salience. The salience of a sound can be deﬁned as its prominence rel- ative to other sounds or, more generally, with respect to a back- ground. Although the distinction between salience and attention is debated, it is well accepted that salience represents “bottom up” processes while attention deals with “top down”, task-driven ones. Soniﬁcation is a subtype of auditory display that uses non- speech audio to present and represent information [1, 2]. For an effective soniﬁcation, it is necessary to predict the salience of the sounds that will be used. This is because bottom-up mechanisms, including salience, shape the listener’s involuntary organization of the sounds generating the scene [3]. * corresponding author: tord@cim.mcgill.ca. FT is with the Department of Electrical and Computer Engineering and the Centre for Interdisciplinary Research in Music, Media and Technology (CIRMMT) at McGill University, www.cirmmt.org. This work is licensed under Creative Commons Attribution Non Commercial 4.0 International License. The full terms of the License are available at http://creativecommons.org/licenses/by-nc/4.0 To understand the effects of salience on scene perception, we need a computational model that maps a set of acoustical features to the perceived salience of a sound. There are two important challenges to do so: the difﬁculty of gathering perceptual data (our ground truth), and the selection of the features to be used for salience prediction. The ground truth has to be collected using behavioral experi- ments that allow labeling and ranking of a set of sounds based on their perceived salience. With respect to the second challenge, there is a possibly inﬁ- nite set of acoustic and perceptual features from which we might choose. Therefore, the ability to predict the salience of a sound using a reduced set of such features is highly desirable. This work addresses both issues, ﬁrst through an experimental paradigm that extracts ground truth, and second using those experimental results to select features. These features represent the building blocks for our computational model of salience. 2. RELATED WORK 2.1. Salience and soniﬁcation Soniﬁcation implicitly deals with salience and the management of attention in its sound design principles and guidelines (see, for ex- ample, Hunt et al. [4] or Bakker et al. [5]). The main themes of the research agenda present in the Soni- ﬁcation Report [1] show very little need for modiﬁcation after al- most two decades of research. Kramer et al. [1] speciﬁed soniﬁ- cation as the “transformation of data relations into perceived rela- tions in an acoustic signal for the purposes of facilitating commu- nication or interpretation”. The challenges behind the words “rela- tion” and “perceived” used therein still deserve attention from the research community. In fact, the complexity and the importance of taking into account the perceptual and cognitive dimensions when designing soniﬁcation systems are well documented [6, 7]. Modern soniﬁcation calls for the exploration of the use of nat- ural sounds as a complement, or alternative to metaphoric, iconic ones and the use of designs with “sourcy” environments where real, dynamic sounds are not presented in isolation. The use of natural, environmental sounds is especially interesting when gen- erating immersive, continuous soundscapes. The soniﬁcation of continuous data needs an auditory display that can be easily dis- tinguished from the background when necessary, but can also be allowed to fade out of attention, and not be annoying or intrusive when not desired [8, 9]. Iconic, symbolic sounds are often per- ceived as artiﬁcial and their acceptability under prolonged listen- ing conditions is the result of a very careful sound design. Natural The 21 st International Conference on Auditory Display (ICAD 2015) July 8–10, 2015, Graz, Austria ICAD 2015 - 236