OPEN MIND SPEECH RECOGNITION Jean-Marc Valin Universit´ e de Sherbrooke 2500 boulevard de l’Universit´ e Sherbrooke, Qu´ebec J1K 2R1 CANADA valj01@gel.usherb.ca David G. Stork * Ricoh Silicon Valley 2882 Sand Hill Road #115 Menlo Park, CA 94025-7022 USA stork@OpenMind.org ABSTRACT We describe speech research through the Open Mind Initia- tive, which provides a framework for large-scale collabora- tive efforts in building components of “intelligent” systems using the internet. Based on Open Source methodology, the Open Mind Initiative allowsdomain specialiststo contribute algorithms, tool developers to provide software infrastruc- ture and tools, and non-specialist “e-citizens” to contribute training data and information to large databases. An impor- tant challengeis to make iteasy and rewarding for e-citizens to provide such information. We describe the current status of such speech research, and several challenges and oppor- tunities associated with the Open Mind Initiative. 1. INTRODUCTION The development of classiﬁers and “intelligent” machines — ones that can understand speech, summarize stories, en- gage in conversation, etc. — relies on both theory [6] and data, and there has been incrementalimprovementin anum- ber of areas. Most subdisciplines in speech analysis and recognition require large corpora for progress; many would proﬁt from an open framework for experimentation and col- laboration. We discuss a new methodology — The Open Mind Initiative — to this end. The paper is organized as follows: In Sect. 2 we stress the need for large corpora and an open framework for sys- tems engineering and integration forfurtherprogress in sev- eral problems in speech processing and recognition. We then discuss the Open Mind Initiative, in particular its three components — domain experts, tool developers and non- specialist “e-citizens” — and brieﬂy compare and contrast it with traditional Open Source. Then, in Sect. 3 we present currentworkon developing speech recognition systemswhich could beused withthe Open Sourceoperatingsystem Linux. Section 4 mentions some unsolved problems, research di- rections and conclusions. * Presentingauthor 2. THE OPEN MIND INITIATIVE In very broad terms, recent work in many areas of pattern recognition and artiﬁcial intelligence has relied more and more upon fairly general models, such as powerful statisti- cal ones, trained with a great deal of data. The fundamen- taltheoreticalunderpinnings ofdomain-independentpattern recognition— maximum-likelihoodand Bayesiantechniques, function estimation, and so on — are highly developed and rigorous. While there willcontinueto beeffortandprogress, the foundations as currently understood are sufﬁcient for developing successful pattern classiﬁers in many domains. The adequacy of even very simple models is illustrated in optical character recognition, where recognizers based on simple models (decision trees, neural networks, ...) trained with millions ofcharacters outperform recognizersbased on sophisticated models trained with less data [7]. This need for large training sets is a lesson that recurs in a number of domains, from acoustic speech recognition [9], speechread- ing [13], natural language processing [5], speech produc- tion [15], and others. For many areas where we may not yet have adequate models, we nevertheless know how to broaden and improve classes of models — to include more degreesoffreedom to accountforsources ofvariation, to set parameters, and so on — given enough data. In summary, then, it appears that in many interesting domains, particu- larly speech, large data sets are necessary. The appreciation of the need for large knowledge bases and training data has led to the construction of publicly available databases. The National Institutes of Standards and Technology (NIST), the Linguistics Data Consortium (LDC), and others have compiled large databases of train- ing data related to speech, language, documents and other domains. A representative example is that of the Macro- phone project, compiled by Texas Instruments, a collec- tion of roughly 200,000 utterances of free telephone speech from non-specialists, constrained by topic [2]. While these and other public databases have been vital to continued im- provements in recognizers, some of the best systems are