Acoustic distinctions between speech and singing: Is singing acoustically more stable than speech? Beatriz Raposo de Medeiros 1 , Jo˜ ao Paulo Cabral 2 1 University of S˜ ao Paulo, Brazil 2 Trinity College Dublin, Ireland biarm@usp.br, cabralj@scss.tcd.ie Abstract In this paper we study how spoken and sung versions of the same text differ in terms of the variability in duration and pitch. These two modalities are usually studied separately and few works can be found in the literature that report results about the comparison of their acoustic properties. In this work, record- ings of both speech and singing of Brazilian Portuguese popu- lar songs were conducted. Then, the variability was measured by statistical analysis of the fundamental frequency and speech rate, speciﬁcally the mean and variance. In a ﬁrst study this was done at the syllable and sentence levels and latter at the phone level for further analysis. In general, results show that speech and singing variability cannot be differentiated in terms of the variance. We expected different results because singing is more constrained than speech both in terms of pitch (small variation within the note) and duration (metrical constraint). It seems that the results of higher pitch stability for singing reported in the literature cannot be generalised, particularly for the popular genre in which there is a prosodic proximity between singing and speech. These interesting ﬁndings also motivate to analyse other aspects of dynamic pitch and duration to better understand the prosodic differences between the two modalities. Index Terms: speech and singing, prosody, acoustic analysis 1. Introduction In general, humans can perceptually distinguish speech from singing and it seems we do this in a very intuitive way. How- ever, it is also true that there are spontaneous spoken utterances that can sound as if they were sung. At ﬁrst glance it seems dif- ﬁcult to unravel the phenomena that are common in both song and speech, but a reasonable number of scholars have already compared speech and music regarding memory, melody per- ception, intelligibility, rhythm and intonation [1, 2, 3, 4, 5, 6]. Although these studies reveal differences between musical and speech elements, they shed light on the interplay between pitch and rhythm. The illusory transformation from speech to song in [3] demonstrates the interesting similarities between them. This paper describes two perceptual experiments in which the repeti- tion of the same utterances, as well as the insertion of modiﬁed utterances, made listeners judge the spoken phrases as a song. The fundamental frequency (F0) is an important acous- tic parameter that differentiates speech and singing [3, 7]. A singer sustains F0 at an approximately constant value over rel- atively long durations, such as during the musical note, which tend to follow each other in a controlled way. In contrast, a speaker generally produces more rapid and frequent F0 tran- sitions. Some authors refer to F0 stability in the comparison between speech and song, e.g. [8, 9, 10, 11]. The stability deﬁ- nition we use in this work is aligned with the Tonal Hypothesis [9], that is, two tonal properties lead to a perceptual shift from speech to song: more stable tone targets and musical scalar in- tervals. Those works propose that song has a more isochronous rhythm and greater F0 stability within each syllable, which is in concordance with the assumption that a musical note ap- proximately matches the syllable unit. Their experiments are mainly or solely based in speech stimuli, for example by using the speech to song illusion demonstrated in [3]. In our work, the aim is to distinguish speech from singing and we compare directly stimuli obtained by recordings of these two modalities. Evidences such as those brought by [3, 8, 9] in- dicate that low level acoustic characteristics need to be taken into account to explain the intimate relations between music and speech, as well as their differences. The aim of the present work is then to answer to the speciﬁc question: How to distin- guish speech from singing taking into account acoustic param- eters related to duration and pitch? The hypothesis is that for both prosodic aspects, singing will show greater stability than speech. We chose statistical measures commonly used to analyse the stability of F0 and duration, which are their mean and vari- ance. Our analysis of stability also depends on the time window chosen, because the acoustic stability is directly related to the variation of acoustic parameters along a certain time interval. For instance, an entire song is likely to be perceived as singing and hardly as speech. But we expect that the phenomena of am- biguous behaviour at issue can be better captured in shorter time intervals. Our stimuli preparation is similar to that in [8]: First, sentences were selected as linguistic units and secondly sylla- bles were chosen as another unity in which F0 and rhythm were analysed. Following this study, we also conducted the mea- surements at the phone level for all the vowels. The reason for this was that we observed that dynamic acoustic effects may occur within the syllable (between phonetical or even musical elements) for singing, leading to results that would conceal our F0 stability hypothesis. The remainder of this paper is struc- tured as follows. Section 2 describes the experiment elabora- tion, mainly how spoken and sung texts (which we name from now on as songs) were chosen, and its development. Section 3 presents results obtained for both duration and pitch aspects. Finally, Sections 4 and 5 are dedicated to discussion and con- clusion, respectively. 2. The Experiment 2.1. Selection of Songs The texts chosen to be spoken and sung were those of four songs that constitute sung versions of excerpts from the literary work Macunaima. Due to its importance in the Brazilian modern lit- erature, the book was followed by its spoken versions in a ﬁlm and a play, around the 60s, whose titles were the same of the 9th International Conference on Speech Prosody 2018 13-16 June 2018, Poznań, Poland 542 10.21437/SpeechProsody.2018-110