Handwritten Document Image Analysis at Los Alamos: Script, Language, and Writer Identification Judith Hochberg, Kevin Bowers, Michael Cannon, and Patrick Kelly Mail Stop B265, Los Alamos National Laboratory, Los Alamos, NM 87545 {judithh, tmc, kelly}@lanl.gov kbowers@eecs.berkeley.edu Abstract A system for automatically identifying the script used in a handwritten document image is described. The system was devel- oped using a 496-document dataset repre- senting six scripts, eight languages, and 281 writers. Documents were character- ized by the mean, standard deviation, and skew of five connected component features. A linear discriminant analysis was used to classify new documents, and tested using writer-sensitive cross-validation. Classifi- cation accuracy averaged 88% across the six scripts. The same method, applied within the Roman subcorpus, discriminated English and German documents with 85% accuracy. Pilot results indicate that a vari- ation of the method may be applicable to writer identification. 1. Introduction Script and language identification are important parts of the automatic processing of document images in an international en- vironment. A document's script (e.g., Cyrillic or Roman) must be known in order to choose an appropriate optical character recognition (OCR) algorithm. For scripts used by more than one language, knowing the language of a document prior to OCR is also helpful. And language identification is crucial for further processing steps such as routing, indexing, or translation. For scripts such as Greek, which are used by only one language, script identifi- cation accomplishes language identifica- tion. For scripts such as Roman, which are used by many languages, it is normally as- sumed that script identification will take place first, followed by language identifi- cation within the script (e.g. [1]). Alterna- tively, it may be possible to skip script identification as an intermediate step, rec- ognizing languages directly regardless of their script. To the best of our knowledge, script identification has never been attempted for handwritten documents. Because of the dramatic individual differences in handwrit- ing, we found a feature-based approach to be most successful, in contrast to the tem- plate matching we have previously applied to machine printed documents [2-3]. In the spirit of Wilensky et al. [4], each document was characterized by a single feature vec- tor, containing summary statistics taken across the document's black connected components. The documents were then classified using linear discriminant analysis. The main focus of this work was script identification: the method was 88% accu- rate in distinguishing among six scripts, in- cluding challenging pairs of related (and vi- sually similar) scripts such as Ro- man/Cyrillic and Chinese/Japanese. We al- so took a first look at language identifica- tion within the Roman script: the method was 85% accurate for English versus Ger- man documents. Finally, we report promis- ing pilot results (80% accuracy for a rough implementation) on a variation of our method applied to writer identification from free text. 2. Data We assembled a corpus of 496 hand- written documents from six scripts: Arabic, Chinese, Cyrillic, Devanagari, Japanese, and Roman. The scripts are illustrated in Figure 1. For the most part, document im- ages were obtained from foreign language speakers we were acquainted with or whom we contacted through the Internet. Over