A Multimodal Approach to Dictation of Handwritten Historical Documents Vicent Alabau, Ver´ onica Romero, Antonio-L. Lagarda, Carlos-D. Mart´ ınez-Hinarejos Institut Tecnol` ogic d’Inform` atica, Universitat Polit` ecnica de Val` encia Camino de Vera, s/n, 46022, Valencia, Spain {valabau,vromero,alagarda,cmartine}@iti.upv.es Abstract Handwritten Text Recognition is a problem that has gained at- tention in the last years due to the interest in the transcription of historical documents. Handwritten Text Recognition employs models that are similar to those employed in Automatic Speech Recognition (Hidden Markov Models and n-grams). Dictation of the contents of the document is an alternative to text recogni- tion. In this work, we explore the performance of a Handwritten Text Recognition system against that of two speech dictation systems: a non-multimodal system that only uses speech and a multimodal system that performs a text recognition which is used in the posterior speech recognition. Results show that the multimodal combination outperforms any of the other consid- ered non-multimodal systems. Index Terms: speech recognition, dictation, language mod- elling, handwritten text recognition 1. Introduction In the last years, many on-line archives and digital libraries are publishing large quantities of digitised legacy documents. These documents must be transcribed into an appropriate tex- tual electronic format in order to allow text-based search of their contents and provide historians and other researchers new ways of indexing, consulting and querying their contents. However, the vast majority of these documents (hundreds of terabytes of digital image data) remain waiting to be transcribed into a tex- tual electronic format. Therefore, manual transcription of these documents is an important task for making available the con- tents of digital libraries. These transcriptions are usually carried out by experts in paleography, who are specialised in reading ancient scripts. These scripts are characterised by different handwritten/printed styles from diverse places and time periods. The time that takes for an expert to make a transcription of one of these documents depends on their skills and experience. Most paleographs agree that each page needs several hours to be transcribed. In this context, Handwritten Text Recognition (HTR) [1] has become an important research topic. HTR tries to obtain the word sequence contained in the image of a handwritten text line. This process needs a previous detection of lines of text in an image, as well as some preprocessing steps to make the handwritten text more regular. The ﬁnal result is a sequence of words (transcription) of the text line, that may contain errors. When the rate of errors of the transcription is low enough, HTR can be a very useful tool to speed up the transcription of hand- written text documents. However, when consulting paleographs on the most com- fortable method to transcribe a handwritten text document, many of them claim that a dictation of the words is the best option. Consequently, Automatic Speech Recognition (ASR) systems are an important alternative to HTR systems. In addi- tion, the current state-of-the-art ASR and HTR systems share many features: Hidden Markov Models (HMM) [2, 3] are used to model the basic elements of the signal (sounds for speech, strokes for handwritten text) and n-grams language models (LM) are used to model word sequences [2]. From this view- point, HTR systems ﬁt in the Natural Language Processing paradigm. Therefore, many features that are usual to ASR sys- tems (such as the use of training data for HMM and n-grams) are common to HTR systems as well. The similarities between the two types of systems make possible to combine them easily into a multimodal system that may obtain a more reliable ﬁnal hypothesis, since two differ- ent data sources (handwritten text and speech) can be used. In fact, previous attemps in combining handwritten input and speech input have been done [4], but most of them center in the use of on-line handwritten text. In this work, we compare the use of speech dictation to transcribe handwritten text documents against the direct use of text recognition. Speech dictation is de- veloped in non-multimodal (when only an ASR system is avail- able) and multimodal (when both HTR and ASR systems are available) scenarios. We will show that using an initial HTR recognition allows to restrict the set of ASR hypothesis and obtain better results than only using text recognition or plain speech dictation. The paper is organised as follows: Section 2 describes the fundamentals of a HTR system, Section 3 explains the use of the HTR decoding to improve the ASR recognition, Section 4 summarises the experimental set-up, Section 5 shows the re- sults, and Section 6 provides the main conclusions and future work lines in this ﬁeld. 2. Handwritten text recognition The HTR problem can be formulated as the problem of ﬁnding the most likely word sequence, w =(w1,w2,...,w |w| ), for a given handwritten sentence image represented by a feature vector sequence, x =(x1,x2,...,x |x| ): ˆ w = argmax w P (w|x) = argmax w P (x|w)P (w) (1) P (x|w) is typically approximated by concatenated character models, usually HMMs, and P (w) is approximated by a word LM, usually n-grams [2]. HMMs are used in the same way as they are used in the current ASR systems [3]. The most im- portant differences lay in the type of input sequences of feature vectors: while in the case of ASR they represent acoustic data, the input sequences for off-line HTR are line-image features. Figure 1 shows an example of how a HMM models two feature vector subsequences pertaining to the character “a”. Copyright  2011 ISCA 28 - 31 August 2011, Florence, Italy INTERSPEECH 2011 2245 10.21437/Interspeech.2011-597