International Journal of Computer Applications (0975 – 8887) Volume 71– No.1, June 2013 1 A Perceptive Method for Arabic Word Segmentation using Bounding Boxes by Morphological Dilation Firoj Parwej, Ph.D Department of Computer Science Jazan University, Jazan, Kingdom of Saudi Arabia (KSA) ABSTRACT The character recognition comes under applications of image processing. The character samples are stored in a suitable image format in digital form. Each sample in the image form is properly preprocessed, segmented and the required features defining writer invariants are obtained. The word segmentation is one of the major components in document image analysis. It provides crucial information for skew correction, zone segmentation, and character recognition. The word segmentation is an operation that seeks to decompose an image of a sequence of word into sub images of individual symbols. Its decision, that a pattern isolated from the image is that of a word, can be right or wrong. It is wrong sufficiently often to make a major contribution to the error rate of the system. In this paper we introduced Arabic word segmentation for document images are presented. We are using the bounding box regions to enclose the characters of the Arabic words and then the resulting letter spaces are progressively filled to merge the character bounding boxes to get the Arabic word bounding boxes. The proposed technique is completely avoiding the line segmentation process which normally precedes word segmentation in stuffy methods. We have tested appropriate method on documents of Arabic scripts and theirs have obtained encouraging results from proposed technique. Keywords Arabic Text Line Segmentation, Connected Component Labeling, Morphological Dilation, Arabic Scripts, Document Analysis, Preprocessing. 1. INTRODUCTION The Word segmentation is one of the major components in document image analysis. Although the word [1] line segmentation for machine [2] printed or hand printed documents is usually seen as solved problem freestyle handwritten word line still present a significant challenge. For example handwritten Arabic word lines are curvilinear. Linear or piecewise linear approximation is not accurate in general and neighboring handwritten Arabic word lines may be close or touch each other [3]. No well-defined baselines exist for most handwritten documents. It is still a significant challenge to segment Arabic word lines in handwritten documents. The word lines in a document must be split properly before recognition. The correctness and incorrectness of word line split directly affect accuracies of the word as well as character segmentation and consequently change the [4] accuracies of the word as well as character recognitions. Fig 1: The Arabic word line segmentation using bounding boxes It is not trivial to use bounding boxes to label overlapping curvilinear Arabic word lines, as shown in figure 1. A closed curve more suitable represents the boundary of an Arabic word line. Previous word line split prescript, such as connected component based prescript work directly on the input image. For connected component analysis, each pixel is treated equally and a transformation of one pixel may result in a dissimilar result. If two neighboring Arabic word lines touch each other through furthermore a single handwritten stroke, the segmentation algorithm fails. It’s deal with this problem using a probabilistic approach. The delivery of black pixels in a document is not uniform. At the center of an Arabic word line, a pixel is more likely to be a black pixel than at Arabic word line emptiness. Therefore, it evaluates a likelihood map, where each pixel represents the likelihood that it belongs to an Arabic word line.