IJDAR (2012) 15:71–83
DOI 10.1007/s10032-011-0148-6
ORIGINAL PAPER
CMATERdb1: a database of unconstrained handwritten Bangla
and Bangla–English mixed script document image
Ram Sarkar · Nibaran Das · Subhadip Basu ·
Mahantapas Kundu · Mita Nasipuri ·
Dipak Kumar Basu
Received: 3 June 2010 / Revised: 18 August 2010 / Accepted: 24 January 2011 / Published online: 24 February 2011
© Springer-Verlag 2011
Abstract In this paper, we have described the preparation
of a benchmark database for research on off-line Optical
Character Recognition (OCR) of document images of hand-
written Bangla text and Bangla text mixed with English
words. This is the first handwritten database in this area,
as mentioned above, available as an open source document.
As India is a multi-lingual country and has a colonial past,
so multi-script document pages are very much common. The
database contains 150 handwritten document pages, among
which 100 pages are written purely in Bangla script and rests
of the 50 pages are written in Bangla text mixed with Eng-
lish words. This database for off-line-handwritten scripts is
collected from different data sources. After collecting the
document pages, all the documents have been preprocessed
and distributed into two groups, i.e., CMATERdb1.1.1, con-
Electronic supplementary material The online version of this
article (doi:10.1007/s10032-011-0148-6) contains supplementary
material, which is available to authorized users.
R. Sarkar · N. Das · S. Basu · M. Kundu · M. Nasipuri (B )
Computer Science and Engineering Department,
Jadavpur University, Kolkata 700032, India
e-mail: nasipuri@vsnl.com
R. Sarkar
e-mail: raamsarkar@gmail.com
N. Das
e-mail: nibaran@gmail.com
S. Basu
e-mail: subhadip8@yahoo.com
M. Kundu
e-mail: mkundu@cse.jdvu.ac.in
D. K. Basu
A.I.C.T.E. Emeritus Fellow, Computer Science and Engineering
Department, Jadavpur University, Kolkata 700032, India
e-mail: dipakkbasu@gmail.com
taining document pages written in Bangla script only, and
CMATERdb1.2.1, containing document pages written in
Bangla text mixed with English words. Finally, we have also
provided the useful ground truth images for the line seg-
mentation purpose. To generate the ground truth images, we
have first labeled each line in a document page automatically
by applying one of our previously developed line extraction
techniques [Khandelwal et al., PReMI 2009, pp. 369–374]
and then corrected any possible error by using our developed
tool GT Gen 1.1. Line extraction accuracies of 90.6 and
92.38% are achieved on the two databases, respectively, using
our algorithm. Both the databases along with the ground truth
annotations and the ground truth generating tool are available
freely at http://code.google.com/p/cmaterdb.
Keywords Unconstrained handwritten document image
database · Text line extraction · Ground truth preparation ·
OCR of multi-script document
1 Introduction
OCR involves computer recognition of characters from dig-
itized images of optically scanned document pages. The
characters thus recognized from document pages are coded
with American Standard Code for Information Interchange
(ASCII) or some other standard code for storing in a file,
which can be edited as any other file created with some word
processing software or some editor. A scanner with OCR
facility allows editing the contents of document pages after
scanning them optically.
Identification of text lines is the first and most important
step in the process of OCR of handwritten/printed document
images. If text line identification is not accurate, then none of
the words and characters in the constituent text lines can be
123