Layout Analysis for Arabic Historical Document Images Using Machine Learning Syed Saqib Bukhari * , Thomas M. Breuel Technical University of Kaiserslautern, Germany bukhari@informatik.uni-kl.de, tmb@informatik.uni-kl.de Abedelkadir Asi * , Jihad El-Sana Ben-Gurion University of the Negev, Israel abedas@cs.bgu.ac.il, el-sana@cs.bgu.ac.il Abstract Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page mar- gins (a.k.a side-notes text) from manuscripts with com- plex layout format. Simple and discriminative features are extracted in a connected-component level and sub- sequently robust feature vectors are generated. Multi- layer perception classifier is exploited to classify con- nected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmenta- tion and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of com- plex side-notes layout formats, achieving a segmenta- tion accuracy of about 95%. 1 Introduction Manually copying a manuscript was the ultimate way to spread knowledge before printing houses were established. Scholars added their own notes on page margins mainly because paper was an expensive ma- terial. Historians regard the importance of the notes’ content and the role of their layout; these notes became an important reference by themselves. Hence, analyz- ing this content became an inevitable step toward a re- liable manuscript authentication [11] which would sub- sequently shed light on the manuscript temporal and ge- ographical origin. * these authors contributed equally. Figure 1. Arabic historical document im- age with complex layout formatting due to side-notes text. Physical structure of handwritten historical manuscripts imposes a variety of challenges for any page layout analysis system. Due to looser format- ting rules, non-rectangular layout and irregularities in location of layout entities [2, 11], layout analysis of handwritten ancient documents became a challenging research problem. In contrast to algorithms which cope with modern machine-printed documents or historical documents from the hand-press period, algorithms for handwritten ancient documents are required to cope with the above challenges. Page layout analysis is a fundamental step of any document image understanding system. The analysis process consists of two main steps, page decomposi- tion and block classification. Page decomposition seg- ments a document image into homogeneous regions, 2012 International Conference on Frontiers in Handwriting Recognition 978-0-7695-4774-9/12 $26.00 ' 2012 IEEE DOI 10.1109/ICFHR.2012.227 635