Himalayan Linguistics
Namsel: An optical character recognition system for Tibetan text
Zach Rowinski Kurt Keutzer
University of California, Berkeley
A BSTRACT
The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming
increasingly popular in humanities and social science research. Unfortunately, Tibetan Studies has lacked
such a repository of electronic, searchable texts. The automated recognition of printed texts, known as
Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust
OCR systems for the Tibetan language have not been available. In this paper, we introduce one new system,
called Namsel, which uses Optical Character Recognition (OCR) to support the production, review, and
distribution of searchable Tibetan texts at a large scale. Namsel tackles a number of challenges unique to
the recognition of complex scripts such as Tibean uchen and has been able to achieve high accuracy rates on
a wide range of machine-printed works. In this paper, we discuss the details of Tibetan OCR, how Namsel
works, and the problems it is able to solve. We also discuss the collaborative work between Namsel and its
partner libraries aimed at building a comprehensive database of historical and modern Tibetan works—a
database that consists of more than one million pages of texts spanning over a thousand years of literary
production.
K EYWORDS
NLP, Tibetan, Optical Character Recogition, OCR
This is a contribution from Himalayan Linguistics, Vol. 15(1): 12–30.
ISSN 1544-7502
© 2016. All rights reserved.
This Portable Document Format (PDF) file may not be altered in any way.
Tables of contents, abstracts, and submission guidelines are available at
escholarship.org/uc/himalayanlinguistics