Bag of characters and SOM clustering for script recognition
and writer identification
Simone Marinai, Beatrice Miotti, Giovanni Soda
Dipartimento di Sistemi e Informatica
Universit` a di Firenze
Firenze, Italy
Email: marinai@dsi.unifi.it
Abstract—In this paper, we describe a general approach for
script (and language) recognition from printed documents and
for writer identification in handwritten documents. The method
is based on a bag of visual word strategy where the visual words
correspond to characters and the clustering is obtained by
means of Self Organizing Maps (SOM). Unknown pages (words
in the case of script recognition) are classified comparing their
vectorial representations with those of one training set using a
cosine similarity. The comparison is improved using a similarity
score that is obtained taking into account the SOM organization
of cluster centroids. Promising results are presented for both
printed documents and handwritten musical scores.
I. I NTRODUCTION
The automatic identification of the language or script used
in printed documents is useful for the automatic archiving of
multi-language documents, for the choice of a suitable OCR
engine or for the indexing of documents in digital libraries.
A related problem is writer identification in handwritten
documents that is useful to historians and in the forensic
practice.
According to [1] script recognition can be approached
with local or global methods. Local analysis systems work
in most cases at the character level. In [2] Zhou et al.
propose a system for Bangla/English script identification
based on the analysis of connected component profiles. In
[3] Chanda et al. present a system for Sinhala, Tamil, and
English script identification that is based on the extraction of
character features derived from the water reservoir principle.
Systems based on a global approach envisage a preliminary
segmentation to identify the text blocks. A texture approach
is proposed in [4] for the recognition of Latin, Greek, and
Japanese scripts. Features based on Gabor Energy and gray-
level co-occurrence matrices are used to represent each text
block. A system based on the texture of the document
blocks is proposed in [5] for the writer identification of
music scores. Tan et al. [6] approach the recognition of
Arabic, Roman, and Tamil scripts in a collection of on-
line handwritten documents by means of the vector space
model and the tf-idf weighting schema computed on features
extracted at the line level.
In computer vision applications (e.g. [7]) some methods
adopt the so-called “bag of visual words” where images
(or document images) are represented considering the oc-
currences of some “visual words” in the images. Visual
words are identified on the basis of a visual dictionary that
is obtained by clustering the feature vectors that describe
local information on key-points in the images. Each cluster
can be seen as an equivalence class of similar patterns
whose occurrences characterize a given type of documents.
A similar approach has been considered for the indexing
of graphical document images taking into account bags of
symbols in [8].
In this paper we propose a local method and we follow a
bag of visual words approach at the character (or symbol)
level in order to approach script recognition and writer iden-
tification. In particular, we propose to use a Self Organizing
Map neural network for the clustering to take advantage of
the topological organization of the clusters (that is granted
by the SOM structure) in the recognition.
The rest of the paper is organized as follows. In Section
II we summarize the proposed approach that is evaluated in
the experiments described in Section III. Conclusions and
future work are discussed in Section IV.
II. THE PROPOSED APPROACH
The method described in this paper involves four main
steps. First, characters and symbols are extracted from
the images and represented with suitable feature vectors.
Second, vector quantization is performed by clustering the
feature vectors and then representing each vector by the
index of the cluster it belongs to. Third, each page to be
indexed is represented with weighted frequencies of symbols
belonging to each cluster taking into account the tf-idf
weighting scheme. Fourth, objects to be identified (words
or pages) are represented similarly to the indexed pages
and then classified with a k-nn classifier; the similarity is
computed with the cosine of the angle between weight vec-
tors, in analogy with the vector space model in Information
Retrieval.
The character/symbol extraction is achieved by means
of connected components that are computed in an image
that is processed with morphological dilations so as to
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.534
2174
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.534
2186
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.534
2182
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.534
2182
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.534
2182