IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 2, FEBRUARY 2008 467
Semantic Annotation and Retrieval
of Music and Sound Effects
Douglas Turnbull, Student Member, IEEE, Luke Barrington, David Torres, and Gert Lanckriet
Abstract—We present a computer audition system that can both
annotate novel audio tracks with semantically meaningful words
and retrieve relevant tracks from a database of unlabeled audio
content given a text-based query. We consider the related tasks of
content-based audio annotation and retrieval as one supervised
multiclass, multilabel problem in which we model the joint proba-
bility of acoustic features and words. We collect a data set of 1700
human-generated annotations that describe 500 Western popular
music tracks. For each word in a vocabulary, we use this data to
train a Gaussian mixture model (GMM) over an audio feature
space. We estimate the parameters of the model using the weighted
mixture hierarchies expectation maximization algorithm. This
algorithm is more scalable to large data sets and produces better
density estimates than standard parameter estimation techniques.
The quality of the music annotations produced by our system is
comparable with the performance of humans on the same task.
Our “query-by-text” system can retrieve appropriate songs for a
large number of musically relevant words. We also show that our
audition system is general by learning a model that can annotate
and retrieve sound effects.
Index Terms—Audio annotation and retrieval, music informa-
tion retrieval, semantic music analysis.
I. INTRODUCTION
M
USIC is a form of communication that can represent
human emotions, personal style, geographic origins,
spiritual foundations, social conditions, and other aspects of hu-
manity. Listeners naturally use words in an attempt to describe
what they hear even though two listeners may use drastically
different words when describing the same piece of music.
However, words related to some aspects of the audio content,
such as instrumentation and genre, may be largely agreed upon
by a majority of listeners. This agreement suggests that it is
possible to create a computer audition system that can learn the
relationship between audio content and words. In this paper,
we describe such a system and show that it can both annotate
novel audio content with semantically meaningful words and
retrieve relevant audio tracks from a database of unannotated
tracks given a text-based query.
Manuscript received December 16, 2006; revised November 8, 2007. This
work was supported by the National Science Foundation (NSF) under Grants
IGERT DGE-0333451 and DMS-MSPA 062540922. Some of the material pre-
sented in this paper was presented at SIGIR’07 and ISMIR’06. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Prof. Mark Sandler.
D. Turnbull and D. Torres are with the Department of Computer Science and
Engineering, University of California at San Diego, La Jolla, CA 92093 USA
(e-mail: dturnbul@cs.ucsd.edu; datorres@cs.ucsd.edu).
L. Barrington and G. Lanckriet are with the Department of Electrical and
Computer Engineering, University of California at San Diego, La Jolla, CA
92093 USA (e-mail: lbarrington@ucsd.edu; gert@ece.ucsd.edu).
Digital Object Identifier 10.1109/TASL.2007.913750
TABLE I
AUTOMATIC ANNOTATIONS GENERATED USING THE AUDIO CONTENT.
WORDS IN BOLD ARE OUTPUT BY OUR SYSTEM AND THEN PLACED
INTO A MANUALLY CONSTRUCTED NATURAL LANGUAGE TEMPLATE
We view the related tasks of semantic annotation and re-
trieval of audio as one supervised multiclass, multilabel learning
problem. We learn a joint probabilistic model of audio content
and words using an annotated corpus of audio tracks. Each track
is represented as a set of feature vectors that is extracted by
passing a short-time window over the audio signal. The text de-
scription of a track is represented by an annotation vector,a
vector of weights where each element indicates how strongly a
semantic concept (i.e., a word) applies to the audio track.
Our probabilistic model is one word-level distribution over
the audio feature space for each word in our vocabulary. Each
distribution is modeled using a multivariate Gaussian mixture
model (GMM). The parameters of a word-level GMM are es-
timated using audio content from a set of training tracks that
are positively associated with the word. Using this model, we
can infer likely semantic annotations given a novel track and
can use a text-based query to rank-order a set of unannotated
tracks. For illustrative purposes, Table I displays annotations of
songs produced by our system. Placing the most likely words
from specific semantic categories into a natural language con-
text demonstrates how our annotation system can be used to gen-
erate automatic music reviews. Table II shows some of the top
songs that the system retrieves from our data set, given various
text-based queries.
Our model is based on the supervised multiclass labeling
(SML) model that has been recently proposed for the task of
1558-7916/$25.00 © 2008 IEEE