The Community of Multimedia Agents Gang Wei, Valery A. Petrushin and Anatole V. Gershman Accenture Technology Labs 161 N. Clark Street Chicago, IL 60601 {gang.wei, valery.a.petrushin, anatole.v.gershman}@accenture.com Abstract. Multimedia data mining requires the ability to automatically analyze and understand the content. The Community of Multimedia Agents project is devoted to creating a community of researchers and students who are interested in developing multimedia annotation algorithms. It provides an open environment for developing, testing, learning and prototyping multimedia content analysis and annotation methods. It serves as a medium for researchers to contribute and share their achievements while protecting their proprietary techniques. Each method is represented as an agent that can communicate with the other agents registered in the environment using templates that are based on the descriptors and description schemes in the MPEG-7 standard. Using the standard allows agents that are developed by different organizations to operate and communicate with each other seamlessly regardless of their programming languages and internal architecture. A development environment is provided to facilitate the construction of media analysis methods. The tool contains a workbench, which allows the user integrating agents to build more sophisticated systems, and a blackboard browser, which visualizes the processing results. It enables researchers to compare the performance of different agents and combine them to build a rapid prototype of more powerful and robust system. The Community can also serve as a learning environment for researchers and students to acquire and exchange of cutting edge multimedia analysis algorithms. 1 Introduction The extraction of information from multimedia data is of vital importance with the explosive growth of digitized image, audio and video data. It requires the ability to automatically analyze, understand and annotate multimedia content. A large number of techniques have been proposed in this area, ranging from simple measures like color histogram for image, energy estimates for audio signal, to more sophisticated systems like speaker emotion recognition in audio [1], automatic summarization of TV programs [2] and topic detection and tracking using audio transcripts [3]. However, the capability of the current techniques is still far from the requirements of the applications in practice, especially in terms of intelligence level and robustness. For example, even the most advanced face recognition algorithms can be easily fooled by a little makeup or environmental changes. We believe that the reliable understanding