Proceedings of Timbre 2018: Timbre Is a Many-Splendored Thing, 5-7 July 2018, Montreal, Quebec, Canada Towards translating perceptual distance from hearing to vision using neural networks Etienne Richan 1† Jean Rouat 1 1 NECOTIS, départ. génie électrique et génie informatique, Université de Sherbrooke, Sherbrooke, QC, Canada † etienne.richan@usherbrooke.ca Aims/goals The goal of this research project is to develop a system mapping sounds to images in a manner that respects perceptual distances in both modalities. In other words, the degree of perceived difference between two sounds should be reflected in the dissimilarity of the images generated from those sounds. This is not a trivial problem, and there does not seem to be much work of reference on the translation of perceptual distances between modalities. In our approach, timbral features are used to measure perceived auditory distance and a neural network for style transfer is used to synthesize textural images. Our software project allows users to select sound-image pairings that are meaningful to them, which can then be extrapolated to other sounds. The generated images aim to assist the task of distinguishing between different musical sound samples without needing to hear them. Background information It has been shown that features such as spectral centroid and log attack time are highly correlated with the perceptual classification of musical timbre [8]. It is of interest to study what visual metaphors would be most effective for representing these auditory dimensions. Research on audio-visual correlations has generally focused on relatively simple auditory and visual features such as loudness, pitch, position, shape and colour [10]. Recent studies have explored the more complex correlation between timbral features and 2D and 3D shapes [1, 9]. Other studies have investigated the correspondences between timbral and textural properties such as coarseness, granularity and regularity [6]. While these studies have discovered some strong correlations between auditory features and visual attributes, they are limited in the manner that they map a single auditory parameter to a single visual parameter. Neural networks provide a general framework for extracting features from complex distributions in a hierarchical manner. The hierarchical representation of visual features by neural image classification networks is comparable to that of the visual cortex [3] (at least in the initial layers). This allows us to suppose that features extracted by these networks can be used as a measure for visual perceptual distance. Style transfer networks are a recent offshoot of research on neural networks for image classification. The covariance matrices between feature maps at different layers of these deep convolutional networks are representative of the stylistic (or textural) structure of images [5]. Initial solutions allowed for the transfer of artistic style from one image to another by iterative optimization of an input image through backpropagation [5]. More recent approaches [7] have greatly sped up the process by directly altering the feature maps of an input image to match those of the desired style using a signal whitening and colouring approach. Stylized images can be reconstructed from this latent representation using a pretrained decoder network. This algorithm provides a novel approach to controlling image space; instead of directly manipulating coordinates, colour schemes or geometry, we can use the network to synthesize and interpolate between different textures. Methodology The research project is divided into three phases. First, finding a suitable space for audio samples based on timbral features where proximity is representative of their perceptual similarity. The Essentia library [2] is used to extract a set of low-level timbral features from monophonic musical samples. The nSynth dataset [4] provides testing and training samples. It contains ~300000 four-second notes from ~1000 acoustic, electric and synthetic instruments, played across their respective ranges. The dataset also provides semantic quality annotations for each note (e.g. bright, dark, distorted). We train a shallow neural network to learn a mapping from timbral features to a space where notes with common qualities 1