Meeting Recording System via Multimodal Sensing Shingo Tokunaga, Yoshimichi Ito, Naoko Nitta and Noboru Babaguchi Graduate School of Engineering, Osaka University, 2-1 Yamada-oka Suita Osaka 565-0871, JAPAN Abstract. In this paper, we propose a recording system for a round-table meeting using micro- phone array as well as omnidirectional video camera. These equipments are located at the center of the table, and record all the activity during a meeting. They are also used for estimating the directions of speakers, and such data is exploited for reproducing the frontal image of the speaker from omnidirectional image. Experimental results using the prototype system are also shown. 1 Introduction In usual case, the results and the progress of a meeting, such as what is determined, and who said what, are summerized in a meeting log and are recorded by paper-based media. The paper-based media is useful from the viewpoint of simplicity. However, it has a performance limitation when we want to make a precise record of vivid activities of human communications and the environment during a meeting, such as the situation of discussions, the behavior of speakers (smile, angry, confused, excited, etc.) and that of other participants (nod, agree, disagree, sleep, etc.). Recently, in order to record such an human activity during a meeting, an inteligent log system called multimedia log system has been proposed [1],[2],[3],[4],[5]. A system for recording lectures has been also proposed and is used for distance lectures [6]. The multimedia log consists of two types of data, namely, the data in which the contents of utterances of speakers are structured, and the audio/video data at each event during the meeting. By linking these two types of data, a log user can observe the expressions and appearances of speakers together with the contents of the meeting. The meeting environment of our concern is a round-table meeting with a few participants. In im- plementing a meeting recording system, the following problems due to the sensing environment arise: a large scale set is required, even for a small scale meeting, to sense the apperarance of each participant; it is difficult to sense the frontal view image of each speaker since the video cameras are located behind the participants to avoid disterbance of the meeting. Therefore, it is very important to construct the meeting recording system free from the sensing environment problems. In order to overcome the above difficulties, we propose a meeting recording system using omnidirec- tional video camera [7],[8],[9] and microphone array [10]. They are located at the center of the table, and record all the activities during a meeting. They are also used for estimating the directions of speakers, and the data is exploited for reproducing the frontal view image of the speaker with high fidelity. The above idea using omnidirectional video camera and microphone array has been also proposed by [2] and [5]. However, they only use audio data for speaker localization, and the experimental evaluation for their localization method have not yet been done. What distinguishes our localization method from the above ones is that we do not only use the audio data, but also utilize the video data for speaker localization. It is expected that the use of such multimodal sensing enables us to estimate the directions of speakers more accurately. In this paper, we focus on the topic of speaker locallization using multimodal sensing, and that of the experimental evaluation of our method. The topic of making a meeting log is ommitted because of space limitation. 2 Speaker localization via multimodal sensing The data flow of speaker localization using multimodal sensing and the process for making the frontal view images of speakers are shown in Fig. 1. The audio data is recorded by the microphone array system, and is used for estimating the location of a speaker. As an algorithm for the speaker localization, we use the cross-correlation method. The audio data is also used for making the contents of a meeting log, and is storaged as structured data which records the contents of utterances of speakers. The video data is captured by the omnidirectional video camera with a hyperboloidal mirror [7]. The omnidirectional video camera is capable of capturing a 360-degree horizontal view field scene at a time, and can be applied for analysis of human behaviors and for surveillance systems [11]. Since