Hand Motion-Controlled Audio Mixing Interface Jarrod Ratcliffe Department of Music and Performing Arts Professions, Music Technology New York University jpr350@nyu.edu ABSTRACT This paper presents a control interface for music mixing using real time computer vision. Two input sensors are considered: the Leap Motion and the Microsoft Kinect. The author presents predominant design considerations, including improvement of the user’s sense of depth and panorama, maintaining broad accessibility through integration of the system with Digital Audio Workstation (DAW) software, and implementing a system that is portable and affordable. To provide the user with a heightened sense of sound spatialization over the traditional channel strip, the concept of depth is addressed directly using the stage metaphor. Sound sources are represented as colored spheres in a graphical user interface to provide the user with visual feedback. Moving sources back and forward controls volume, while left to right controls panning. To provide broader accessibility, the interface is configured to control mixing within the Ableton Live DAW. The author also discusses future plans to expand functionality and evaluate the system. Keywords music mixing, music production, computer vision 1. INTRODUCTION The impact of technology on music over the past 60 years is quite difficult to overstate. It has changed the way music is performed, recorded, consumed, and in many cases, how it is composed. It has spawned the invention of new musical instruments. Without these developments, the NIME conference and community may not exist today. It is quite remarkable then, to consider that new interfaces have had such a limited impact on music production and recording studio technology. Since the first analog mixing console was released in the late 1950s, very little has changed in its design. For each incoming channel, a channel strip is used with several knobs, switches, and a fader, each controlling a specific parameter in a one-to-one mapping. This interface has carried through to digital mixing consoles, and metaphorically into software mixers in digital audio workstations. It is a common goal in audio mixing to create a specific sonic image for the listener, utilizing psychoacoustics to provide localization cues. In a stereo mix, the most basic parameters that control this sonic image are lateral position and depth. With the channel strip metaphor, these parameters are approximated using a pan potentiometer and a fader (controlling level) respectively. The author would like to specify that these are approximate correlations, because, in a physical space there are other psychoacoustic parameters that correlate with sound localization in humans, including spectral content, and time delays. These parameters are often emulated using artificial reverberation. While the channel strip metaphor offers precise control over many sonic parameters in a mix, there is one main challenge with the mapping techniques that are used for width and depth: By looking at a mixing console, the position of a sound source within the image is not necessarily immediately apparent. Pan potentiometers are a reasonable representation of apparent lateral position, however the channel that is physically located on the left-most side of the interface may in fact be panned hard right, creating a dissonance for the user. The user must look at the position of the pan knobs for every channel to get a sense of the lateral position of a source, and there is typically no direct way to visualize this sense of stereo image. In addition, with the channel strip metaphor, faders control level, which has a perceived impact on depth of a sound source. The challenge with this mapping is that the fader position is actually inversely proportional to apparent depth. A fader that is in the position closest to the user produces a sound that is perceived as being the farthest away. This inverse relationship is easily learned by the user, however, when compounded with the dissonance created between the lateral position of the channel strip and the position of its corresponding sound source, this can make it difficult to localize multiple sound sources simultaneously, posing challenges to the user in the perception of relationships between sources. The current work addresses the control of apparent lateral position and depth using the stage metaphor. The stage metaphor defines a listening point and allows the user to control panning and level with the distance between the sound source and the listening point in two corresponding dimensions simultaneously. While most implementations of the stage metaphor either use keyboard and mouse or multi-touch (see Section 2) , the current work explores an implementation using computer vision with position tracking. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NIME’14, June 30 – July 03, 2014, Goldsmiths, University of London, UK. Copyright remains with the author(s). Figure 1. Channel Strip vs. Stage Metaphor Proceedings of the International Conference on New Interfaces for Musical Expression 136