JOURNAL OF L A T E X CLASS FILES 1 Automatic Minimisation of Masking in Multitrack Audio using Subgroups David Ronan, Zheng Ma, Paul Mc Namara, Hatice Gunes, and Joshua D. Reiss, Abstract—The iterative process of masking minimisation when mixing multitrack audio is a challenging optimisation problem, in part due to the complexity and non-linearity of auditory perception. In this article, we ﬁrst propose a multitrack masking metric inspired by the MPEG psychoacoustic model. We investigate different audio processing techniques to manipulate the frequency and dynamic characteristics of the signal in order to reduce masking based on the proposed metric. We also investigate whether or not automatically mixing using subgrouping is beneﬁcial or not to perceived quality and clarity of a mix. Evaluation results suggest that our proposed masking metric when utilised in an automatic mixing framework reduces inter-channel auditory masking as well as improves the perceived quality and perceived clarity of a mix. Furthermore, our results suggest that using subgrouping in an automatic mixing framework can also improve the perceived quality and perceived clarity of a mix. Index Terms—Auditory Masking; Multitrack Mixing; MPEG; Equalization; Dynamic Range Processing; Subgrouping; Numerical Optimisation; Perceived Emotion ✦ 1 I NTRODUCTION M ASKING is a perceptual property of the human audi- tory system that occurs whenever the presence of a strong audio signal makes the temporal or spectral neigh- bourhood of weaker audio signals imperceptible [1], [2]. Frequency masking may occur when two or more stimuli are simultaneously presented to the auditory system. The relative shapes of the masker’s and maskee’s magnitude spectra determine to what extent the presence of certain spectral energy will mask the presence of other spectral energy. Temporal masking is the characteristic of the auditory system where sounds are hidden due to a masking signal occurring before (pre-masking) or after (post-masking) a masked signal. The effectiveness of temporal masking atten- uates exponentially from the onset and offset of the masker [3]. A simpliﬁed explanation of masking phenomena is when a strong noise or tone masker creates an excitation of sufﬁcient strength on the basilar membrane. An excitation pattern is a neural representation of the pattern of resonance on the basilar membrane, caused by a given sound [4]. The area around the characteristic frequency (referred to as the frequency bandwidth of the ”overlapping bandpass ﬁlter” created by the cochlea) of the masker’s signal location effec- tively blocks the detection of weaker signals [3]. Examples of frequency and temporal masking are shown in Figure 1 and Figure 2 respectively. Mixing is a process in which multitrack material whether recorded, sampled or synthesised is balanced, • D. Ronan is with the Centre for Intelligent Sensing, Queen Mary University of London, UK. E-mail: d.m.ronan@qmul.ac.uk • H. Gunes is with the Computer Laboratory, University of Cambridge, UK. E-mail: hatice.gunes@cl.cam.ac.uk • J.D. Reiss is with the Centre for Digital Music, Queen Mary University of London, UK. E-mail: joshua.reiss@qmul.ac.uk Fig. 1. Frequency masking example of a 150 Hz tone signal masking an adjacent frequency tone by increasing the threshold of audibility around 150 Hz. Fig. 2. Schematic drawing to illustrate and characterise the regions within which pre-masking, simultaneous masking and post masking occur. Note that post-masking uses a different time origin than pre- masking and simultaneous masking.[3] treated and combined into an output format that is most commonly two channel stereo [5]. In the process of mixing, sound sources inevitably mask one another, which reduces the ability to fully hear and dis- tinguish each sound source. Partial masking occurs when- ever the audibility of a sound is degraded due to the pres- ence of other content, but the sound may still be perceived. arXiv:1803.09960v2 [eess.AS] 28 Mar 2018