Training TESSERACT Tool for Amazigh OCR KHADIJA EL GAJOUI 1 , FADOUA ATAA ALLAH 2 , MOHAMMED OUMSIS 3 1 Laboratory of research in Informatics and Telecommunications, Faculty of Sciences – Rabat, Mohammed V University, Rabat, MOROCCO 2 CEISIC, The Royal Institute of Amazigh Culture, Rabat, MOROCCO 3 Department of Computer Science, School of Technology-Sale, Mohammed V University, Sale, MOROCCO khadija.gajoui@gmail.com, ataaallah@ircam.ma, oumsis@yahoo.com Abstract: - The Optical Character Recognition is the operation of converting a text image into an editable text file. Several tools have been developed as OCR systems. Techniques used in each system vary from one system to another, therefore the accuracy changes. In this paper, we present an example of available OCR tools, and we train TESSERACT tool on the Amazigh language transcribed in Latin characters. Key-Words : OCR; Amazigh; Tesseract; Training. 1. Introduction Over the last five decades, machine reading has grown from a dream to reality. Optical character recognition has become one of the most successful applications of technology. Many systems for performing OCR exist for a variety of applications, although the machines are still not able to compete with human reading capabilities. The Amazigh language is spoken by a significant part of the population in North Africa. It became official in Morocco since 2011. Yet few studies on OCR systems have been interested in that language either written in Tifinagh alphabet or transcribed in Arabic or Latin letters. With the development experienced by research on optical character recognition, field of research in pattern recognition, artificial intelligence and computer vision, various tools have been designed to achieve a conversion from text image to editable text with a quite high recognition rate [1]. The tools dedicated to OCR are either open source or paid according to their license. In the remaining of this paper, we define, in Section 2, the OCR system architecture and we present different approaches developed for each modules of the system. In Section 3, we introduce the Amazigh language writing. In Section 4, we present examples of OCR tools. In Section 5, we list the training steps of Tesseract tool for the Amazigh language. Then, we show, in section 6, the evaluation of the system tested on a set of documents extracted from different books. Finally, in Section 7, we draw conclusions and suggest further related research.