HUMAN LANGUAGE TECHNOLOGY: OPPORTUNITIES AND CHALLENGES Mari Ostendorf 1 Elizabeth Shriberg 2,3 Andreas Stolcke 2,3 1 University of Washington 2 SRI International 3 International Computer Science Institute mo@ee.washington.edu {ees,stolcke}@speech.sri.com ABSTRACT In recent years, there has been dramatic progress in both speech and language processing, in many cases leveraging some of the same underlying methods. This progress and the growing tech- nical ties motivate efforts to combine speech and language tech- nologies in spoken document processing applications. This paper outlines some of the issues involved, as well as the opportunities, presenting an overview of the special double session on this topic. 1. INTRODUCTION Human language technology (HLT) provides important tools for making use of the vast amount of information in documents avail- able via the web, and significant recent progress has been made in areas such as text retrieval, analysis, summarization and trans- lation. While much of this work has focused on text documents, speech and video signals are also increasingly available. We refer to such signals – including TV and radio broadcasts, congressional records, oral histories, voicemail, call center recordings, etc. – as “spoken documents”. As speech recognition technology improves, language processing for spoken audio has attracted increased inter- est. And because is takes longer to listen to audio than to read text, spoken documents are clearly a prime candidate for automatic in- dexing, information extraction, and other such technologies. Over the last decade, the speech processing and natural lan- guage processing communities have developed largely indepen- dently, though many of the algorithms stem from the same funda- mental theory. With the growing importance of spoken document processing, there is now a need to bridge this gap. This session takes a step towards this goal, by introducing speech researchers to downstream applications that could be applied to speech (and video), and by providing language processing researchers with insights into what speech has to offer beyond word information. Many of the papers in this session raise issues in applying text- based technologies to spoken documents. The differences between written and spoken documents have implications for both speech and language processing modules. In addition, since HLT is ulti- mately needed for human processing of information, we include two papers on assessing of the impact of technology on human performance in various information processing tasks. Despite little interaction between the speech and language pro- cessing communities, there has been some technology exchange through work in dialog systems, and both communities are lever- aging advances in machine learning. Hence, we expect the session will also bring to light a wealth of shared algorithmic methods that could be useful in both fields, and where cross-fertilization is likely to provide mutual benefits. A few such shared techniques are highlighted in this overview; we encourage our readers to search for more examples in the papers in this session. The goal of this paper is to set the context for the session, providing background on the various technologies and raising is- sues that cut across the relevant fields. In Section 2 we give an overview of the state-of-the-art in large vocabulary speech recog- nition, to provide perspective on what might be available for spo- ken document processing. In Sections 3 and 4, we outline issues in speech processing that impact language processing, including information beyond the words and methods for handling speech recognition errors. In Section 5 we discuss some common threads in the methods used in speech and language processing. Finally, in Section 6 we provide an overview of the eleven invited papers in this special double session. 2. LARGE VOCABULARY SPEECH RECOGNITION Most HLT applications require the ability to accurately tran- scribe unrestricted, open-vocabulary speech. Since the mid-1990s, progress in large vocabulary recognition has been driven by an- nual evaluations conducted by NIST for automatic transcription of broadcast news (BN), conversational telephone speech (CTS), and recently multi-party meetings [1]. Evaluation conditions have become more difficult over the years, by the imposition of fac- tors such as runtime limits, automatic segmentation requirements, and broadening of data sources. Nevertheless, word error rates (WERs) have declined from around 30% for BN and above 50% for CTS, to below 10% and 15%, respectively. These improve- ments are due in part to availability of increasing amounts of train- ing data, which now comprise more than 2000 hours for both En- glish BN and CTS. But there have been many research achieve- ments as well, including techniques that make use of cheaper and therefore larger data sources (e.g. training on errorful transcrip- tions). In addition, the availability of more data has spawned the development of more sophisticated models. The systems have achieved remarkable convergence, across both sites and domains. In the paragraphs below, we overview key elements typically found in the NIST-evaluated systems. Front ends use cepstral analysis in combination with dimen- sionality reduction techniques, such as heteroscedastic LDA, start- ing from up to third-order delta features or the concatenated cep- stral vectors from several adjacent frames. Recent developments are discriminatively trained feature extraction methods such as fMPE [2] or multi-layer perceptrons [3]. A host of techniques are used to reduce mismatch between trained models and test data, and to reduce inter-speaker variability in training. Standard techniques include vocal-tract length normalization, adaptation of acoustic models using maximum likelihood linear regression (MLLR), and speaker-adaptive training based on MLLR. The acoustic models are mixtures of Gaussians, typically with several hundred thou- sand to a million distributions with diagonal covariances; recently