An Efficient Segmentation Scheme for the Recognition of Printed Bangla characters S.M. Milky Mahmud, Nazib Shahrier, A.S.M Delowar Hossain Md. Tareque Mohmud Chowdhury, Md. Abdus Sattar Department of Computer Science and Information Technology Islamic University of Technology (IUT), Gazipur E-mail:milky_iut@hotmail.com , nazib_shahrier@yahoo.com , masattar@iut-dhaka.edu , tareque1974@yahoo.com Abstract This paper focuses on the segmentation of printed Bangla characters for efficient recognition of the characters. Bangla is one of the most popular scripts in the world. The segmentation of characters is an important step in the process of character recognitions because it allows the system to classify the characters more accurately and quickly. In case of Bangla, the problem is a more difficult one because there are about 300 basic, modified and com- pound character shapes in the script. Here the characters are topologically connected and Bangla is an inflectional language. Because of these complexities there exists no efficient system relating to segmentation. These complex problems have been overcome in our proposal. We have applied a two phase approach in order to overcome the common problems related to the segmentation of printed Bangla characters. Keywords Segmentation, text digitization, skew detection and correc- tion, depth-first search (DFS) . 1. INTRODUCTION We are concerned here with the recognition of Bangla, the second most popular script and language in the Indian sub- continent. About 200 million people of Eastern India and Bangladesh use this language, making it the fourth most popular in the world. This paper focuses on the segmenta- tion procedure of the printed Bangla characters for proper identification and classification of them. In Bangla, the number of characters is large and two or more characters combine to form new character shapes called compound or conjunct characters. This requires segmentation and classi- fication of about 300 characters. The segmentation of these complex and compound characters is a cumbersome task. Thus, segmentation of Bangla script is more difficult than that of any European script. In this paper, we have de- scribed some properties of Bangla characters and its prob- lems or complexities in the process of segmentation. Text digitization and skew correction techniques have also been taken into account. And finally, as our last step we have proposed some new techniques to effectively overcome the problems in the case of word and character segmentation. 2. TEXT DIGITIZATION AND SKEW CORRECTION At first the document is scanned and digital image is formed. Then, if any noise is present then it is cleaned and corrected. The skew correction techniques were followed from the theories mentioned by B.B. Chaudhuri and U.Pal [3]. The procedures followed in our system for proper seg- mentation is given below. 2.1 Text digitization and noise cleaning The process of text digitization can be performed either by a Flat-bed scanner of a hand-held scanner. Hand held scan- ner typically has a low resolution range. Also, we have to move the hand-held scanner very slowly over the image in order to scan the image properly. As because of these com- plexities, we used a Flat-bed scanner. The scanner was manufactured by Hewlett-Packard, Model number psc 1200 series. The digitized images are in gray tone and we have used a histogram-based threshold approach to convert them into two-tone images. For a clear document the histo- gram shows two prominent peaks corresponding to white and black regions. The threshold value is chosen as the midpoint of the two histogram peaks. The two-tone image is converted into ‘0’ and ‘1’ labels where ‘1’ and ‘0’ repre- sent object and background respectively. The digitized im- age shows protrusions and dents in the characters as well as isolated black pixels over the backgrounds which are cleaned by a logical smoothing approach. 2.2 Skew detection and correction The use of the scanner should be properly maintained. Cas- ual or improper use of the scanner may lead to skew in the document range. Skew angle is the angle that the text lines of the document image makes with the horizontal direction. Skew correction can be achieved in two steps. First, we have to determine the skew angle Ө and as the second step we will rotate the image by Ө, in the opposite direction. Here, we proposed an approach based on the observation of head line of Bangla script. If a digitized document is skewed then the headline also satisfies the properties of a skewed digital straight line (DSL) which, if detected, gives estimate of the skew angle. This approach is accurate, ro- bust and computationally efficient. An advantage of this