COLLECTING AND SHARING BILINGUAL SPONTANEOUS SPEECH CORPORA: THE CHINFADIAL EXPERIMENT Georges FAFIOTTE*, Christian BOITET*, Mark SELIGMAN*, ZONG Chengqing** *GETA, CLIPS, IMAG-campus (UJF - Grenoble 1) 385 rue de la Bibliothèque, BP 53 F-38041 Grenoble Cedex 9 (France) **National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences P.O.Box 2728 Beijing 100080 (China) {georges.fafiotte, christian.boitet}@imag.fr, mark.seligman@spokentranslation.com, cqzong@nlpr.ia.ac.cn ABSTRACT We describe here the three main platforms in the ERIM family of Web-based environments for human interpreting, two of them in more details, ERIM-Interp and ERIM-Collect, then ERIM-Aid. Each platform supports an aspect of the collecting or study of spontaneous bilingual dialogues, translated by an interpreter. ERIM-Interp is the core environment, providing mediated communication between speakers and human interpreters over the network. Using ERIM-Collect, French-Chinese interpreting data have been collected within the 3-year "ChinFaDial" project supported by LIAMA, a French-Chinese laboratory in Beijing. These "raw" speech data will be made available in the spring of 2004 on an open-access basis, using the DistribDial server, on a CLIPS- GETA website. Our goal is to extend such corpora, on a collaborative scheme, to allow other research groups to contribute to the site whatever annotations they may have created, and to share them under the same conditions (GPL). An ERIM-Aid variant is intended to provide focused machine aids to Web-based human interpreters, or to monolingual distant speakers conversing in different languages. KEYWORDS data collection, spontaneous speech, dialogue, speech corpora, interpreter, interpreting, free distribution, freeware. INTRODUCTION One of our ultimate research goals is to build systems for automatic speech interpretation (translation of speech) over the Web. Much progress has been made in this area over the past ten years. NEC produced the first speech translation demo, within the tourist domain, in September 1992, but the most widely known coordinated research efforts to date include the C-STAR projects (international Consortium for Speech Translation Advanced Research) [6], the European NESPOLE! IST project [9], the German Verbmobil project [10], and the US DARPA Communicator program [8] with the Galaxy Communicator Software Infrastructure. All have demonstrated platforms enhancing spontaneous speech processing in multilingual person-person or person-system communication, always in restricted domains. At the same time, we are convinced that human interpreters will remain vital, both as irreplaceable suppliers of subtle nuances and as models for automatic systems. Human interpreting, too, will inevitably be carried out through the Web or its successors. Thus we foresee a continuing need for research on Web-based interpreting, and for data collection of realistic Web-based interpreting sessions (see Furuse & al [4], for related data collection efforts). We expect the collected data to be useful for training or tuning automatic speech translation systems. It can also be used to study dialogue phenomena in order to adapt software elements, lexicons, etc., to dialogue situations. Unfortunately, the human resources required to collect such data are always scarce. Thus we recognize a need to recruit data contributors and processors from the world at large, following the open source model. We aim to induce volunteer interpreters or students of interpretation to translate bilingual dialogues online, by exchanging this on-line help for free use of our Web-based lab for e - learning of the interpretation trade. The ERIM human and automatic speech translation platforms have been implemented in several variants. In Section I, we describe the motivation and design for the ERIM-Interp base platform. In Section II, we present ERIM-Collect, an extension of ERIM-Interp dedicated to the collection of interpreting data. In Section III, we describe the ChinFaDial data collected so far, to be accessed soon on a free distribution web site. In Section IV, we sketch current developments, and plans to "consolidate" the platform and to add new plugins to extend it to the area of instruction or training for interpreters —an extension which we hope will in turn lead to the collection of new data. Conclusions follow. I. ERIM-INTERP, FOR HUMAN INTERPRETATION ON THE WEB From our previous work with the multimodal Wizard of Oz Speech Translation platform EMMI at ATR-ITL [5], and other work on monolingual multi-Wizard architectures (NEIMO [2]), and from experience gained in our lab with the C-STAR II and Nespole! projects, we concluded that, even with high quality automatic interpreting systems, there should be a real human "warm body" or "guardian angel" in the loop anyway. Thus a realistic design for online network-based interpretation should "integrate" both human and machine interpretation. ERIM platforms have been developed on this basis (ERIM in French stands for Network-based Environment for Multimodal Interpreting). 1. Motivation Some companies have already developed proprietary network-oriented "interpreter's cubicles", which are the counterparts of existing fixed installations for interpreting 619