Bag of characters and SOM clustering for script recognition and writer identification Simone Marinai, Beatrice Miotti, Giovanni Soda Dipartimento di Sistemi e Informatica Universit` a di Firenze Firenze, Italy Email: marinai@dsi.unifi.it Abstract—In this paper, we describe a general approach for script (and language) recognition from printed documents and for writer identification in handwritten documents. The method is based on a bag of visual word strategy where the visual words correspond to characters and the clustering is obtained by means of Self Organizing Maps (SOM). Unknown pages (words in the case of script recognition) are classified comparing their vectorial representations with those of one training set using a cosine similarity. The comparison is improved using a similarity score that is obtained taking into account the SOM organization of cluster centroids. Promising results are presented for both printed documents and handwritten musical scores. I. I NTRODUCTION The automatic identification of the language or script used in printed documents is useful for the automatic archiving of multi-language documents, for the choice of a suitable OCR engine or for the indexing of documents in digital libraries. A related problem is writer identification in handwritten documents that is useful to historians and in the forensic practice. According to [1] script recognition can be approached with local or global methods. Local analysis systems work in most cases at the character level. In [2] Zhou et al. propose a system for Bangla/English script identification based on the analysis of connected component profiles. In [3] Chanda et al. present a system for Sinhala, Tamil, and English script identification that is based on the extraction of character features derived from the water reservoir principle. Systems based on a global approach envisage a preliminary segmentation to identify the text blocks. A texture approach is proposed in [4] for the recognition of Latin, Greek, and Japanese scripts. Features based on Gabor Energy and gray- level co-occurrence matrices are used to represent each text block. A system based on the texture of the document blocks is proposed in [5] for the writer identification of music scores. Tan et al. [6] approach the recognition of Arabic, Roman, and Tamil scripts in a collection of on- line handwritten documents by means of the vector space model and the tf-idf weighting schema computed on features extracted at the line level. In computer vision applications (e.g. [7]) some methods adopt the so-called “bag of visual words” where images (or document images) are represented considering the oc- currences of some “visual words” in the images. Visual words are identified on the basis of a visual dictionary that is obtained by clustering the feature vectors that describe local information on key-points in the images. Each cluster can be seen as an equivalence class of similar patterns whose occurrences characterize a given type of documents. A similar approach has been considered for the indexing of graphical document images taking into account bags of symbols in [8]. In this paper we propose a local method and we follow a bag of visual words approach at the character (or symbol) level in order to approach script recognition and writer iden- tification. In particular, we propose to use a Self Organizing Map neural network for the clustering to take advantage of the topological organization of the clusters (that is granted by the SOM structure) in the recognition. The rest of the paper is organized as follows. In Section II we summarize the proposed approach that is evaluated in the experiments described in Section III. Conclusions and future work are discussed in Section IV. II. THE PROPOSED APPROACH The method described in this paper involves four main steps. First, characters and symbols are extracted from the images and represented with suitable feature vectors. Second, vector quantization is performed by clustering the feature vectors and then representing each vector by the index of the cluster it belongs to. Third, each page to be indexed is represented with weighted frequencies of symbols belonging to each cluster taking into account the tf-idf weighting scheme. Fourth, objects to be identified (words or pages) are represented similarly to the indexed pages and then classified with a k-nn classifier; the similarity is computed with the cosine of the angle between weight vec- tors, in analogy with the vector space model in Information Retrieval. The character/symbol extraction is achieved by means of connected components that are computed in an image that is processed with morphological dilations so as to 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.534 2174 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.534 2186 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.534 2182 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.534 2182 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.534 2182