EXPERIMENTS IN AUTOMATIC MEETING TRANSCRIPTION USING JRTK Hua Yu, Cortis Clark, Robert Malkin, Alex Waibel Interactive Systems Laboratories Carnegie Mellon University, Pittsburgh, PA, USA Email: hyu,cortis,malkin,ahw @cs.cmu.edu ABSTRACT In this paper we describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To achieve optimal perfor- mance we started from two different baseline English rec- ognizers adapted to meeting conditions and tested result- ing performance. The data was found to be highly disflu- ent (conversational human to human speech), noisy (due to lapel microphones and environment), and overlapped with background noise, resulting in error rates comparable so far to those on the CallHome conversational database (40-50% WER). A meeting browser is presented that allows the user to search and skim through highlights from a meeting effi- ciently despite the recognition errors. 1. INTRODUCTION Meetings, seminars, lectures and discussions represent ver- bal forms of information exchange that frequently need to be retrieved and reviewed later on. Human-produced min- utes typically provide a means for such retrieval, but are costly to produce and tend to be distorted by the personal bias of the minute taker or reporter. To allow for rapid access to the main points and positions in human conver- sational discussions and presentations we are developing a meeting browser which records, transcribes and compiles highlights from a meeting or discussion into a condensed summary. The early experiments described here report on the particular problem of recognizing conversational speech in meetings and on the user interface of a meeting browser for later presentation. We have recorded discussions of three or more partici- pants. To minimize interference with normal styles of speech, we have ruled out the use of close talking microphones and recorded meetings with lapel microphones on two or more speakers. The resulting speech was found to be highly dis- fluent, similar to spoken telephone conversations as in the Switchboard and CallHome databases, and include many rare words and/or unusual language. The signal quality is further degraded by crosstalk between speakers and rever- beration and echo due to the use of the omnidirectional lapel microphones. 2. MEETING TRANSCRIPTION EXPERIMENTS Different from any other speech recognition task, our par- ticular goal in this task is to improve the performance of ex- isting recognizers on the meeting data, with NO additional training data. As it’s not obvious which existing recognizer to start from, we tried a dictation system (Wall Street Jour- nal system WSJ) and a spontaneous speech system (English Spontaneous Scheduling Task ESST ). We first introduce the test data in Section 2.1, then describe in detail our experi- ments in Section 2.2 and Section 2.3 contains the results and discussion. 2.1. Testing Data The test data is collected in an internal group meeting. 3 lapel microphones were given to 3 of the 10 participants. The meeting was approximately 1 hour in length, giving us 3 hours of speech to test. The 3 speakers consist of 2 fe- males (referred to as flsl and fdmg), and 1 male speaker (re- ferred to as maxl). The advantage of using lapel microphone is that the speaker can wear it in a pocket, not as intrusive as close-talk microphone. The disadvantage is of course, de- graded sound quality. Since it’s not uni-directional, there’s significant channel mixing. There’s also a lot of crosstalk, laughter, electric humming, paper scratching noise, etc. in the recording. 2.2. System Specification Our system is built upon Janus Recognition Toolkit (JRTk), which is summarized in [1]. Incorporated into our contin- uous HMM system are techniques like linear discriminant analysis (LDA) for feature space dimension reduction, vocal tract length normalization (VTLN) for speaker normaliza- tion, cepstral mean normalization (CMN) for channel nor- malization, and wide-context phone modeling (Polyphone