Limits on the Application of Statistical Correlations to Continuous Response Data Finn Upham Music and Audio Research Lab, Department of Music and Performing Arts Professions, Steinhardt School of Culture, Education, and Human Development, New York University, USA finn@nyu.edu ABSTRACT How can we compare different listeners' experiences of the same music? For decades, experimenters have collected continuous ratings of tension and emotion to capture the moment-by-moment experiences of music listeners. Over that time, Pearson correlations have routinely been applied to evaluate the similarity between response A and response B, between the time series averages of responses, and between responses and continuous descriptors of the stimulating music. Some researchers have criticized the misapplication and misinterpretation of this class of statistics, but alternatives have not gained wide acceptance. This paper looks critically at the applicability of correlations to continuous responses to music, the assumptions required to estimate their significance, and what is left of the responses when these assumptions are satisfied. This paper also explores an alternative measure of cohesiveness between responses to the same music, and discusses how it can be employed as a measure of reliability and similarity with empirical estimates of significance. I. INTRODUCTION Continuous ratings of music perception and experience are common measures of the dynamics of a listener’s response. Using some kind of digitally sampled interface, participants report how they perceive or experience the music being presented on scales such as aesthetic experience, tension, and perceived or experienced emotion. Each response forms a time series sampled between 1 and 10 times a second for the duration of the musical stimulus. Although such responses are collected by dozens of researchers around the world, there is little consensus on appropriate techniques for evaluating similarity between responses. Pearson Product Moment Correlations [PPMC] have been naively applied to these time series since the late 1980s in an attempt to capture the reliability of ratings on repeated tasks [Gregory, 1995]. Correlations have since been employed to compare different participants’ responses [Krumhansl, 1996], between sections of responses [Livingstone et al., 2011], and between responses and continuous representations of the music, and to assess legs in responses via cross-correlation [Lucas et al., 2010]. Outside of music cognition work, it is commonly known that correlations cannot be applied blindly to time series data. Schubert in 2002 published an early criticism of the common practice calling out the problem of serial correlation and proposing the practice of analyzing difference data, or reading changes, to reduce the inflation of r-values. Other researchers have attempted to improve matters by using nonparametric correlation measures, such as Spearman [Vines et al., 2006], by downsampling responses to their average Nyquist frequency [Chapin et al., 2010], and by employing autocorrelation models as commonly employed for the analysis of economic time series [Dean and Bailes, 2010]. Despite these warnings and attempts at finding alternatives, researchers have continued to publish analyses of continuous responses using inappropriately applied correlations and estimates of significance. This paper attempts to present in more detail the limits of correlations and the impact of serial correlation in the data, and to deter future abuse of these important classes of calculations. Figure 1. Example of correlation on discrete data: two listeners retrospective ratings of liking on 22 musical excerpts. II. CORRELATIONS A correlation is a standardized measure of covariance between two variables [Rodgers and Nicewander, 1988]. Consider the example shown in figure 1 on discrete data: two subjects’ retrospective liking ratings for 22 excerpts of music. The top graph to the left shows the values from 1 to 7 which each listener gave to each excerpt. To the right are the distributions of each listener’s ratings. Both the bar graph and the estimated normal distribution, , capture the fact that on average subject 1 reported lower ratings than subject 2, and this difference is also shown in the left-most scatterplot, as most of the excerpts fall below the diagonal. Correlations discard differences of means and variances to give conveniently interpretable standardized coefficient values. A Pearson product moment correlation between these rating values gives the same result as the Pearson correlation on the data after normalizing each set of ratings to have a unitless distribution with a zero valued mean and a standard deviation of one. The right-most scatterplot of Figure 1 shows the ratings standardized by rank, in which the rating value on each excerpt is replaced by its rank (or in this case its tied rank) from smallest to largest value within each subject’s distribution of ratings. This non-linear standardization of values is used to compute the non-parametric Spearman correlation, again discarding units of either variable. A correlation coefficient calculated on these two sets of liking ratings expresses how closely the listeners’ relative