Presented at the 1998 ESCA Conference on Speech Technology in Language Learning. Marholmen, Sweden Is Automatic Speech Recognition Ready for Non-Native Speech? A Data Collection Effort and Initial Experiments in Modeling Conversational Hispanic English William Byrne , Eva Knodt , Sanjeev Khudanpur , Jared Bernstein * Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD USA Entropic Research Laboratory, Inc. Menlo Park, CA USA Ordinate Corporation, Menlo Park, CA USA * byrne@jhu.edu, knodt@entropic.com, khudanpur@jhu.edu, jared@ordinate.com Abstract We describe the protocol used for collecting a corpus of conversational English speech from non-native speakers at several levels of proficiency, and report the results of preliminary automatic speech recognition (ASR) exper- iments on this corpus using HTK-based ASR systems. The speech corpus contains both read and conversational speech recorded simultaneously on wide-band and tele- phone channels, and has detailed time aligned transcrip- tions. The immediate goal of the ASR experiments is to assess the difficulty of the ASR problem in language learning exercises and thus to gauge how current ASR technology may be used in conversational computer as- sisted language learning (CALL) systems. The long-term goal of this research, of which the data collection and experiments are a first step, is to incorporate ASR into computer-based conversational language instruction sys- tems. 1 Introduction While automatic speech recognition (ASR) has matured so that large vocabulary speaker dependent dictation is commercially feasible, non-native accents, disfluent speech, and conversational dialogue pose substantial dif- ficulties for ASR systems. To support speech recognition research on conversational and non-native speech, we im- plemented a protocol for collecting spontaneous conver- sations by Hispanic speakers of English. The resulting corpus possesses a set of unique features that make it valuable for advanced speech recognition research on the linguistic characteristics of language learners: Conversations are spontaneous and goal-oriented, covering a broad range of grammatical structures and pragmatic tasks. Detailed, time-aligned transcriptions identify mis- pronunciation, hesitations, and other characteristics of non-native English. Recordings are made simultaneously on four chan- nels (wide-band and telephone speech). English and Spanish read text is available for all sub- jects. Initial experiments suggest that the speech in this database is significantly more difficult to recognize than conversations between native English speakers. We ex- pected that the constrained, task-directed nature of the conversational topics would simplify the language mod- eling task and compensate for the poor acoustic modeling of non-native speech. However, this appears not to be the case. The vocabulary coverage of the material by exist- ing native English conversational corpora is good except for occurrences of proper names, task specific terms and lapses into Spanish; however, the effectiveness of lan- guage models built on these native English corpora as measured by perplexity is poor. This appears to be due in part to the language learners’ difficulties with English, and also to the presence of a fair amount of free conver- sation unrelated to the given tasks. 2 Database Overview The Hispanic-English database covers two different types of speech: wide-band recordings of read speech and four channel, simultaneous, wide-band and telephone channel recordings of spontaneous conversational speech. 2.1 Speaker Demographics The Hispanic-English speech corpus comprises approxi- mately 20 hours of closely transcribed, spontaneous, con- versational speech data from 11 speaker pairs, plus an ad- ditional 100 Spanish and English sentences read by each speaker. Participating subjects were paid and were re- cruited from the Hispanic Community local to Palo Alto, California. All were adult native speakers of Spanish as spoken in South and Central America. The criteria for selection was a minimum of one year of residence in the US, and a basic ability to understand, speak, and read En- glish. As part of the recruiting process, the subjects’ proficiency in English was tested. We used a telephone-based, au- tomated English proficiency test developed by Ordinate Corporation [1]. The test measures the ability of the test taker to comprehend and produce (or reproduce) spoken US English at a normal conversational speaking rate. Ta- ble 1 provides a breakdown of subjects according to gen- der, geographical origin, and test scores. Figure 2.1 plots