Feature-Based Synthesis: A Tool for Evaluating, Designing, and Interacting with Music IR Systems Matt Hoffman Princeton University Computer Science Department 35 Olden Street Princeton, NJ 08544 mdhoffma@cs.princeton.edu Perry R. Cook Princeton University Computer Science & Music Departments 35 Olden Street Princeton, NJ 08544 prc@cs.princeton.edu Abstract We present a general framework for performing feature- based synthesis – that is, for producing audio characterized by arbitrarily specified sets of perceptually motivated, quantifiable acoustic features of the sort used in many music information retrieval systems. 1. Introduction We have implemented a general framework for performing feature-based synthesis, which attempts to synthesize, given any set of feature values, audio that matches those feature values as closely as possible. Depending on how one chooses the values to synthesize, feature-based synthesis can be used to evaluate the usefulness of a given set of features for a particular audio IR domain, to diagnose why a system is not performing as well as expected, as a tool for gaining insight into what information a set of features is encoding, and to generate stimuli for use in studies of human perception. We frame the problem in terms of minimizing the distance between a target feature vector and the feature vector describing the synthesized sound over the set of underlying synthesis parameters. The mapping between feature space and parameter space can be highly nonlinear, complicating optimization. Our framework separates the tasks of feature extraction, feature comparison, sound synthesis, and parameter optimization, making it possible to combine various techniques in the search for an efficient and accurate solution to the problem of synthesizing sounds manifesting arbitrary perceptual features. 2. Motivation 2.1 Feature Evaluation and Selection Feature-based synthesis can be used to address this question of what relevant qualities, if they were encoded by one’s feature set, might enable better performance on a problem. As Lidy, Pölzlbauer, and Rauber [1] observe, one way of qualitatively evaluating the meaningfulness of a feature set is through an analysis-by-synthesis process where one extracts the features in question from multiple sounds from the target domain, synthesizes new sounds matching the extracted features, and compares the original and resynthesized versions. If the resynthesized version of a sound file lacks some quality relevant to the problem at hand, then it is likely that adding a feature representing that quality to the feature set will improve performance. 2.2 Feature Exploration Our system provides an interface for synthesizing audio manifesting feature values specified in real-time, which can be used to gain a more intuitive understanding of how the various features one is using map to actual sounds. Attempting to generate sounds with specific perceptual characteristics in this way can stimulate insights into how much descriptive power a feature set has. 2.3 Perceptual Study Stimulus Generation Studies such as [2] [3] [4] have investigated the human ability to perceive various physical attributes of sound sources. We suggest that feature-based synthesis could be of use in studying the low-level acoustical properties that human listeners use to deduce the more complex physical attributes of a sound’s source. We can generate sounds defined over a set of features we expect to correlate with listeners’ perceptions of, e.g., size, material, or shape, and then use techniques like those described in [5] to determine how those sounds map to the ecological features we wish to study. From the data points obtained in this way, we may be able to discover consistent relationships between acoustical and human-generated features that can be used to predict how a sound manifesting certain acoustic feature values will be perceived. 2.4 Classification System Evaluation We can also treat the confidence outputs of entire classification systems as features to match, enabling us to gain insights into what sorts of audio a system strongly believes fit into one category or another, as well as what sorts of audio it finds difficult to classify. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. © 2006 University of Victoria