ON TECHNIQUES FOR CONTENT-BASED VISUAL ANNOTATION TO AID INTRA-TRACK MUSIC NAVIGATION Gavin Wood University of York York YO10 5DD United Kingdom gav@cs.york.ac.uk Simon O’Keefe University of York York YO10 5DD United Kingdom sok@cs.york.ac.uk ABSTRACT Despite the fact that people are increasingly listening to music electronically, the core interface of the common tools for playing the music have had very little improve- ment. In particular the tools for intra-track navigation have remained basically static, not taking advantage of re- cent studies into the field of audio jisting, summarising and segmentation. We introduce a novel mechanism for musical audio linear summarisation and modify a widely used open source media player to utilise several music information retrieval techniques directly in the graphical user inter- face. With a broad range of music, we provide a quali- tative discussion on several techniques used for content- based music information retrieval and perform quantita- tive investigation to their usefulness. 1 INTRODUCTION In recent years the techniques for content based analysis of musical audio have improved dramatically. Moore’s law continues steadily to provide software with ever-greater resources with respect to processing power, and the ex- tra storage available for media has meant that we are able to store our entire music collection for digital playback. Graphical interfaces to media players have become more elaborate and most mainstream software now supports some sort of visualisation of the music as it plays. 1 In the original generation of the graphical media player, a typical user interface feature would be the “time bar”. This allowed the user to visualise how far through the current track they were, in relation to the length of the track. This was, in many ways, similar to the progress bar in order to show the user how much of a particular task is completed, with the exception that time bars may 1 Though in many cases the correspondence between audio and video leaves much to be desired. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pro- vided that copies are not made or distributed for profit or com- mercial advantage and that copies bear this notice and the full citation on the first page. c 2005 Queen Mary, University of London be used to directly navigate through a track by clicking some way along it. The player would resume playing at the corresponding point through the track. However, as a navigation tool its use is limited due to the fact that the user had to know in advance the approximate place on the bar to deliver the wanted moment of the track. In order to improve the usefulness of this “time bar”, some extra information must be added to it, providing the user with some visual cues. This allows the user to better guess which point along it maps to the particular moment they are trying to find. Many studies (e.g., Bocker et al., 1986) have shown visual cues to be a simple and effective means to convey information to the user without confus- ing a novice or distracting one already familiar. We call this visual annotation a “mood bar”, referring to the vary- ing shades to depict the music content. There is work aplenty in the field of IR with respect to analysing and classifying individual segments of musical audio, perhaps with a view to archiving, retrieval, group- ing, large-scale exploration and browsing. Extensive work has been carried out into forming the user interface to deal with this functionality. Comparatively little has been done to the interface once the necessary segment has been lo- cated and is ready to be played. One might assume that once the user has their segment of data—be it a music track, a monolithic compilation (e.g. live performance) or perhaps a radio broadcast—they are happy to have it play throughout. The concept of user-directed navigation is essential to this work. We set out not to produce a visual annotation by which some (however small) absolute truth may be gained from one single sample. Instead we take a more holistic approach and free ourselves from the constraint that the annotation must mean something absolute and concrete. We allow our visualisation to take on any abstract form, and judge performance as to what, as a human, we are able to ascertain from the final depiction. We go on to measure how useful these depictions are for searching tasks with a broad range of music. The task is an acute acid test; i.e. we give the partici- pants an absolute minimum of learning time. As such the results will heavily favour the annotation methods with more obvious visual cues to those with a more complex visual representation. This is because we wish to test real- istic casual usage; people should not have to suffer a sig- nificant learning curve to use a media player. In particular 58