Where Is the User in Multimedia Retrieval? C ompared to information retrieval or com- puter vision, multimedia retrieval is a rela- tively young discipline. Many people mark the 1992 Visual Information Management Workshop 1 as the beginning of the field. It was there that researchers recognized the need to consider multimedia data, in particular vi- sual information, as a new type of item that could appear in a digital collection. Although the number of items at that time was still small, typically in the thousands, it was orders of magnitude larger than the tens of images that computer vision research was addressing. From that first workshop, we have this impor- tant quote: ‘‘Computer vision researchers will have to identify features required for interactive image understanding, rather than their disci- pline’s current emphasis on automatic tech- niques, and develop techniques to compute these features in interactive environments.’’ The information retrieval field agreed that new techniques were necessary to cope with the specifics of visual data. The notion of visual words, now so popular in visual retrieval, was unheard of at that time. Thus, a new research area was born. The IBM Query by Image Content (QBIC) system was released not long after the 1992 workshop. QBIC was an early example of a query-by-pictorial-example system, where the user selected example images or specified the required images. New possibilities for museum collections and medical imaging arose, but techniques were not yet mature enough to have much impact. Various research efforts started to improve features, especially in terms of their invariance to various conditions. This early pe- riod in content-based retrieval was mainly suc- cessful in bridging the sensory gap. 3 At the end of that period (around 2000), two new conference series started. The ACM Con- ference on Multimedia Information Retrieval (MIR), which began at the University of Illinois, originally focused on computer vision applica- tions and held its 11th meeting in 2010. The first ACM Conference on Information and Video Retrieval (CIVR) in 2001 had a strong connection to library sciences, which hosts a community of archivists who label data at in- sertion time and search for images on request. Until 2007, these conferences always included a few nontechnical papers every year that looked at the physical retrieval techniques (such as labeling). The first VideOlympics, 4 where interactive systems were demonstrated in front of a live audience of scientists and media librarians, was also held in 2007 at the Sound and Vision Archive in The Netherlands. After 2007, these conferences shifted their focus toward the computational side of the problem, with a stronger emphasis on indus- trial applications. We can largely attribute this shift to the important role of TRECVID. 2 Sud- denly, there was easy access to datasets and clearly defined tasks and metrics. Hence, the field had a common goal to pursue and a benchmark to use. Interactive tasks were defined for TRECVID, but the concept-detection task especially flourished. It gave a boost not only in the multimedia retrieval field, but the computer vision community started to embrace the topic as well. In the early days, we could do what we wanted because we were alone and could easily cater to the task at hand. Then Internet came Media Impact Cees Snoek University of Amsterdam Marcel Worring University of Amsterdam, The Netherlands Paul Sajda Columbia University Simone Santini Universidad Autonoma de Madrid, Spain David A. Shamma Yahoo! Research Alan F. Smeaton Dublin City University, Ireland Qiang Yang Huawei Noah’s Ark Lab, Hong Kong Editor’s Note This article summarizes a recent panel discussion at the ACM Inter- national Conference on Multimedia Retrieval, where a case was made for making the interacting user a first-class citizen again in multimedia retrieval research. 1070-986X/12/$31.00 c 2012 IEEE Published by the IEEE Computer Society 6