MULTI-FRAME COMBINATION FOR ROBUST VIDEOTEXT RECOGNITION Rohit Prasad, Shirin Saleem, Ehry MacRostie, Prem Natarajan, Michael Decerbo BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA ABSTRACT Optical Character Recognition (OCR) of overlaid text in video streams is a challenging problem due to various factors including the presence of dynamic backgrounds, color, and low resolution. In video feeds such as Broadcast News, a particular overlaid text region usually persists for multiple frames during which the background may or may not vary. In this paper we explore two innovative techniques that exploit such multi-frame persistence of videotext. The first technique uses multiple instances to generate a single enhanced image for recognition. The second technique uses the NIST ROVER algorithm developed for speech recognition to combine 1-best hypotheses from different frames of a text region. Significant improvement in the word error rate (WER) is obtained by using ROVER when compared to recognizing a single instance. The WER is further reduced by combining hypotheses from frame instances, which were generated using character models trained with different binarization thresholds. A 21% relative reduction in the WER was achieved for multi-frame combination over decoding a single frame instance. Index Terms— Optical Character Recognition, Hidden Markov Models, Videotext 1. INTRODUCTION Huge explosion of multimedia content, especially in form of archived or streaming videos has resulted in a critical need for indexing and archiving such content. While most of the information in video such as Broadcast News (BN) videos is in the visuals (faces, scenes, etc.) and the audio, often unique information is present in text form either as overlaid text that describes the scene or scene text that appears as part of the scene. A key step in indexing based on text in video is to recognize such text [1],[2]. In [2], we had presented a hidden Markov model (HMM) based recognition system configured for recognizing overlaid text in BN videos. The basic component of the system described in [2] is our script-independent Byblos Optical Character Recognition (OCR) [3] engine, which is designed for machine-printed documents and has been recently applied to handwritten documents [4]. The emphasis in [2] was on customization of the Byblos OCR system for videotext recognition. Most of the customization was focused on the pre-processing of the text regions. Since videotext typically has low resolution, the first step in pre-processing was to upsample the text images by a fixed factor. Next, given that our OCR engine expects black on white binarized images, we also developed novel techniques for binarizing the color text images. In this paper, we postulate that the multi-frame persistence of videotext can be exploited to mitigate challenges posed by varying characteristics of videotext across frame instances. In [2], we leveraged the multi-frame persistence property of videotext to generate a single, “contrast enhanced” image for downstream processing. In this paper, we first compare the recognition performance on the contrast enhanced image to recognizing an empirically determined single best instance of a text region. Next, inspired by the system combination [5][6] approaches that have been used with significant success in speech recognition, we explore combining hypotheses from multiple instances of a particular text region. Specifically, we apply the Recognizer Output Voting Error Reduction (ROVER) algorithm [5] for combining 1-best hypotheses from multiple instances to generate a hypothesis which is significantly better than any single instance. Additional reduction in WER are achieved by: (a) incorporating the contrast enhanced image in the hypotheses set for combination, and (b) using character models trained with different binarization thresholds to decode different instances of a text region. 2. OVERVIEW OF HMM BASED VIDEOTEXT OCR Our videotext OCR system is a customized version of the HMM based BBN Byblos OCR system developed for recognizing text in printed documents. A pictorial representation of the BBN Byblos OCR system [3] is given in Figure 1. Knowledge sources are depicted by ellipses and are dependent on the particular language or script. The OCR system components themselves are identified by rectangular boxes and are independent of the particular language or script. Thus, the same OCR system can be configured to perform recognition on any language. The BBN Byblos OCR system can be subdivided into two basic functional components: training and recognition. Both training and recognition share a common pre-processing and feature extraction stage. The pre-processing and feature extraction stage starts off by first deskewing the scanned image and then locating the positions of the text lines on the deskewed image. Next, the feature extraction program computes a feature vector, which is a function of the horizontal position within the line. First, each line of text is Preprocessing + Feature Extraction Scanned Character Modeling Language Modeling Text Training Data Ground Truth TRAINING RECOGNITION Recognition Search Word Sequence Orthographic Rules Character Models Lexicon + Grammar Preprocessing + Feature Extraction Scanned Data Figure 1: Hidden Markov model based Byblos OCR engine.