Feature-Based Synthesis: A Tool for Evaluating, Designing, and Interacting
with Music IR Systems
Matt Hoffman
Princeton University
Computer Science Department
35 Olden Street
Princeton, NJ 08544
mdhoffma@cs.princeton.edu
Perry R. Cook
Princeton University
Computer Science & Music Departments
35 Olden Street
Princeton, NJ 08544
prc@cs.princeton.edu
Abstract
We present a general framework for performing feature-
based synthesis – that is, for producing audio characterized
by arbitrarily specified sets of perceptually motivated,
quantifiable acoustic features of the sort used in many
music information retrieval systems.
1. Introduction
We have implemented a general framework for performing
feature-based synthesis, which attempts to synthesize,
given any set of feature values, audio that matches those
feature values as closely as possible. Depending on how
one chooses the values to synthesize, feature-based
synthesis can be used to evaluate the usefulness of a given
set of features for a particular audio IR domain, to
diagnose why a system is not performing as well as
expected, as a tool for gaining insight into what
information a set of features is encoding, and to generate
stimuli for use in studies of human perception.
We frame the problem in terms of minimizing the
distance between a target feature vector and the feature
vector describing the synthesized sound over the set of
underlying synthesis parameters. The mapping between
feature space and parameter space can be highly nonlinear,
complicating optimization. Our framework separates the
tasks of feature extraction, feature comparison, sound
synthesis, and parameter optimization, making it possible
to combine various techniques in the search for an efficient
and accurate solution to the problem of synthesizing
sounds manifesting arbitrary perceptual features.
2. Motivation
2.1 Feature Evaluation and Selection
Feature-based synthesis can be used to address this
question of what relevant qualities, if they were encoded
by one’s feature set, might enable better performance on a
problem. As Lidy, Pölzlbauer, and Rauber [1] observe, one
way of qualitatively evaluating the meaningfulness of a
feature set is through an analysis-by-synthesis process
where one extracts the features in question from multiple
sounds from the target domain, synthesizes new sounds
matching the extracted features, and compares the original
and resynthesized versions. If the resynthesized version of
a sound file lacks some quality relevant to the problem at
hand, then it is likely that adding a feature representing
that quality to the feature set will improve performance.
2.2 Feature Exploration
Our system provides an interface for synthesizing audio
manifesting feature values specified in real-time, which can
be used to gain a more intuitive understanding of how the
various features one is using map to actual sounds.
Attempting to generate sounds with specific perceptual
characteristics in this way can stimulate insights into how
much descriptive power a feature set has.
2.3 Perceptual Study Stimulus Generation
Studies such as [2] [3] [4] have investigated the human
ability to perceive various physical attributes of sound
sources. We suggest that feature-based synthesis could be
of use in studying the low-level acoustical properties that
human listeners use to deduce the more complex physical
attributes of a sound’s source. We can generate sounds
defined over a set of features we expect to correlate with
listeners’ perceptions of, e.g., size, material, or shape, and
then use techniques like those described in [5] to determine
how those sounds map to the ecological features we wish
to study. From the data points obtained in this way, we
may be able to discover consistent relationships between
acoustical and human-generated features that can be used to
predict how a sound manifesting certain acoustic feature
values will be perceived.
2.4 Classification System Evaluation
We can also treat the confidence outputs of entire
classification systems as features to match, enabling us to
gain insights into what sorts of audio a system strongly
believes fit into one category or another, as well as what
sorts of audio it finds difficult to classify.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and
that copies bear this notice and the full citation on the first page.
© 2006 University of Victoria