A Comparison between Measured and Modelled HRTFs for an Enhancement of Real-time 3D Audio Processing for Virtual Reality Environments Alejandro Saur´ ı Su ´ arez, Jason-Yves Tissi` eres, Luis S. Vieira, Reuben Hunter-McHardy, Sam K. Sernavski, Stefania Seraﬁn Abstract—Sound in Virtual Reality (VR) has been explored in a variety of algorithms which try to enhance the illusion of presence, improving sound localization and spatialization in the virtual environment. As new systems are developed, different models are applied. There is still the need to evaluate and understand the main advantages of each of these approaches. In this study, a performance comparison of two methods for real-time 3D binaural sound tested preferences and quality of presence for headphones in a VR experience. Both the mathematical based HRTF and the convolution based measured HRTF from the MIT KEMAR show a general similarity in the participants sense of localization, depth and presence. Nevertheless, the tests also indicate a preference in elevation perception for the convolution-based measured HRTF. Further experiments with new tools, techniques, contexts, and guidelines are therefore required to highlight the importance and differences between these two methods and other implementations. Index Terms—3D binaural sound, HRTF, VR 1 I NTRODUCTION 3D sound rendering is an important element in Virtual Reality (VR) applications. As visual feedback and narrative move towards immersive interaction, investigating how 3D audio rendering tools complement or enhance visual feedback becomes important. 3D sound offers the dimensions of elevation to aural perception and externalization of audio sources, which are missing in the stereo or surround formats [12]. In binaural 3D audio sound rendering, several signal processing methods have been developed over the years by modeling and/or esti- mating Head Related Transfer Functions (HRTFs). Analysis of head- related spectrum as well as their time-domain representation (called Head related impulse response or HRIR), often resulted in complex ﬁl- ter designs, as discussed by Jyri Huopaniemi and Matti Karjalainen [5]. As an initial assumption, the convolution process requires more computational power than the mathematical approach which would in turn give a better performance for rendering a sound source in a three dimensional space. The use of a mathematical based algorithm enables the continuous interpolation in a 360 motion and an implementation purely based on math operations. The convolution based model requires the upload of a library of audio impulse responses for particular angles that are convolved with the input signal. Moreover, there has recently been a considerable development in this domain, due to the popularity of Virtual Reality (VR) applications, that require proper rendered 3D sound. The main actors in VR technology (Oculus Rift, HTC Vive, Sony Playstation VR, Samsung Gear) develop their own audio engines (plug-ins) for rendering 3D sound. This, however, does not have cross-platform compatibility. Additionally, the in-built 3D sound engine in existing game engines such Unity3D is, by empirical observations, less efﬁcient for human auditory cues. In this study, the experimental hypothesis evaluates data in order to understand which 3D audio rendering model performs better local- ization and spatialization of audio cues in virtual reality in real-time. The comparison does not regard the evaluation of a stereo format as previous studies proved the relevance of binaural algorithms in the sense of presence at an unconscious level [7]. • Stefania Seraﬁn is a professor at Aalborg University Copenhagen. E-mail: sts@create.aau.dk. • Alejandro Saur´ ı Su´ arez et al. are graduate students in Sound and Music Computing at Aalborg University Copenhagen 2 METHODS 2.1 Measured Head-Related Transfer Function: The MIT Kemar Database HRTFs or HRIRs contain the static cues of spatial hearing. They describe “the transmission from a point in the free ﬁeld to a point in the human subject’s or dummy head’s ear canal” [5]. Several HRTF (HRIR) databases are accessible on the Internet, for example, the CIPIC library [6], as well as the MIT KEMAR database [3]. The latter is used here for the implementation of a system capable of real-time interactive binaural rendering using HRTFs from the database. The problem with real-time use of these databases is the computa- tion cost, therefore excessive CPU usage and time-delays may occur. However, for real-time applications, not all of the measurements and samples are needed, thus they can be carefully reduced to a smaller amount [6]. The current project uses the MIT ‘compact’ database, composed of 368 measurements of 128 samples long HRIRs. The database contains the left and right ear HRIR measurements of the sound emitted from 0 ◦ to 180 ◦ on the right side of the dummy head with an elevation from −40 ◦ to 90 ◦ (directly above the head). The density of measurements is higher between elevations −30 ◦ and 30 ◦ because of the high localiza- tion sensitivity of the human ear in this interval. A total of 16383 points were measured, giving a good signal to noise ratio without excessive storage requirement and computation time. Fig. 1. Number of measurements and azimuth increment at each eleva- tion for the ‘compact’ database [3].