IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 2, FEBRUARY 2008 467 Semantic Annotation and Retrieval of Music and Sound Effects Douglas Turnbull, Student Member, IEEE, Luke Barrington, David Torres, and Gert Lanckriet Abstract—We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval as one supervised multiclass, multilabel problem in which we model the joint proba- bility of acoustic features and words. We collect a data set of 1700 human-generated annotations that describe 500 Western popular music tracks. For each word in a vocabulary, we use this data to train a Gaussian mixture model (GMM) over an audio feature space. We estimate the parameters of the model using the weighted mixture hierarchies expectation maximization algorithm. This algorithm is more scalable to large data sets and produces better density estimates than standard parameter estimation techniques. The quality of the music annotations produced by our system is comparable with the performance of humans on the same task. Our “query-by-text” system can retrieve appropriate songs for a large number of musically relevant words. We also show that our audition system is general by learning a model that can annotate and retrieve sound effects. Index Terms—Audio annotation and retrieval, music informa- tion retrieval, semantic music analysis. I. INTRODUCTION M USIC is a form of communication that can represent human emotions, personal style, geographic origins, spiritual foundations, social conditions, and other aspects of hu- manity. Listeners naturally use words in an attempt to describe what they hear even though two listeners may use drastically different words when describing the same piece of music. However, words related to some aspects of the audio content, such as instrumentation and genre, may be largely agreed upon by a majority of listeners. This agreement suggests that it is possible to create a computer audition system that can learn the relationship between audio content and words. In this paper, we describe such a system and show that it can both annotate novel audio content with semantically meaningful words and retrieve relevant audio tracks from a database of unannotated tracks given a text-based query. Manuscript received December 16, 2006; revised November 8, 2007. This work was supported by the National Science Foundation (NSF) under Grants IGERT DGE-0333451 and DMS-MSPA 062540922. Some of the material pre- sented in this paper was presented at SIGIR’07 and ISMIR’06. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Prof. Mark Sandler. D. Turnbull and D. Torres are with the Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093 USA (e-mail: dturnbul@cs.ucsd.edu; datorres@cs.ucsd.edu). L. Barrington and G. Lanckriet are with the Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA 92093 USA (e-mail: lbarrington@ucsd.edu; gert@ece.ucsd.edu). Digital Object Identiﬁer 10.1109/TASL.2007.913750 TABLE I AUTOMATIC ANNOTATIONS GENERATED USING THE AUDIO CONTENT. WORDS IN BOLD ARE OUTPUT BY OUR SYSTEM AND THEN PLACED INTO A MANUALLY CONSTRUCTED NATURAL LANGUAGE TEMPLATE We view the related tasks of semantic annotation and re- trieval of audio as one supervised multiclass, multilabel learning problem. We learn a joint probabilistic model of audio content and words using an annotated corpus of audio tracks. Each track is represented as a set of feature vectors that is extracted by passing a short-time window over the audio signal. The text de- scription of a track is represented by an annotation vector,a vector of weights where each element indicates how strongly a semantic concept (i.e., a word) applies to the audio track. Our probabilistic model is one word-level distribution over the audio feature space for each word in our vocabulary. Each distribution is modeled using a multivariate Gaussian mixture model (GMM). The parameters of a word-level GMM are es- timated using audio content from a set of training tracks that are positively associated with the word. Using this model, we can infer likely semantic annotations given a novel track and can use a text-based query to rank-order a set of unannotated tracks. For illustrative purposes, Table I displays annotations of songs produced by our system. Placing the most likely words from speciﬁc semantic categories into a natural language con- text demonstrates how our annotation system can be used to gen- erate automatic music reviews. Table II shows some of the top songs that the system retrieves from our data set, given various text-based queries. Our model is based on the supervised multiclass labeling (SML) model that has been recently proposed for the task of 1558-7916/$25.00 © 2008 IEEE