Lexicon-Driven Handwritten Character String Recognition for Japanese Address Reading Cheng-Lin Liu, Masashi Koga, Hiromichi Fujisawa Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan {liucl, koga, fujisawa}@crl.hitachi.co.jp Abstract This paper proposes a handwritten character string recognition method for Japanese mail address reading on very large vocabulary. The recognition is performed by classification-embedded lexicon matching based on over-segmentation. The lexicon contains 111,349 ad- dress phrases and is represented in a trie structure. In recognition, the input text line image is matched with all lexicon entries by beam search to obtain reliable character segmentation and retrieve valid phrases. A classifier is embedded in lexicon matching to select from a dynamic set the characters matched with a candidate pattern. The beam search and the character classifi- cation jointly enable accurate phrase identification in real time. In experiments on 3,589 live mail images, the proposed method achieved correct rate of 83.68% with error rate less than 1%. 1. Introduction Automatic reading of handwritten addresses is a challenging task in OCR applications. Address reading can be reduced to a character string recognition [1], [2] or word recognition problem [3], [4], which is difficult due to the fact that the characters cannot be segmented reliably prior to recognition. For Chinese/Japanese handwriting, the difficulty of segmentation originates from not only the character shape variation and touch- ing, but also the variability of internal/external gaps due to the multi-radical structure of characters. To resolve this problem, it is necessary to integrate the classification results and linguistic knowledge into seg- mentation. Two general approaches to do this are the segmentation hypothesis-verification and the lexicon- driven recognition. Lexicon-driven recognition is dis- tinguished in that depending on the lexicon context, the candidate character patterns are generated dynam- ically and the number of characters to match a candi- date pattern is variable. Even though the lexicon-driven approach has been prevalently adopted in word recognition, it was not paid much attention to in Chinese/Japanese charac- ter string recognition until the work of Koga et al. [2], which was successfully applied to printed address im- ages. The proposed method extended that of Koga et al. in three aspects: lexicon size, touching pattern splitting, and search strategy. Aimed for mail dispatch, the lexicon contains 111,349 address phrases from all over Japan. The address phrases are arranged in a trie structure [3], [5], which facilitates the phrase retrieval since the suffix partial strings preceded by a common prefix are explicitly clustered. It also circumvents the large character set problem because in lexicon match- ing, a candidate pattern is compared to the succeeding characters of a prefix partial string instead of all the characters in the classification dictionary. In lexicon matching, we use a beam search strategy [6] to match the input text line image with all lexicon entries simultaneously. The text line is represented as a sequence of image segments after over-segmentation. The search space is represented in a tree structure. A high accuracy character classifier is used to select char- acters from a dynamic character set to match a candi- date pattern. The search space is further reduced by evaluating intermediate paths and pruning those of low scores. In addition, we introduce some efficient tech- niques for image pre-processing and pre-segmentation. The techniques apply to both horizontal writing and vertical writing. 2. Pre-segmentation The input to this character string recognition sub- system is an address block image composed of text lines. Each text line undergoes pre-processing to clean the image, pre-segmentation to generate primitive seg- ments, and lexicon matching to search for valid address phrases. The mail address image is often contaminated by the overlapping postal stamp and other noises. Con- sidering that the noise area usually has smaller stroke