Document Page Decomposition by the Bounding-Box Projection Technique Jaekyu Ha & Robert M. Haralick Ihsin T. Phillips Dept. of Electrical Engineering, FT-1 University of Washington Seattle, WA 98195 0 Dept. of Computer Science Seattle University Seattle, WA 98122 Abstract This paper describes a method for extracting words, textlines and text blocks by analyzing the spatial con- figuration of bounding boxes of connected components on a given document image. The basic idea is that connected components of black pixels can be used as computational units in document image analysis. In this paper, the problem of extracting words, textlines and text blocks is viewed as a clustering problem in the &dimensional discrete domain. Our main strategy is that profiling analysis is utilized to measure horizontal or vertical gaps of (groups of) components during the process of image segmentation. For this purpose, we compute the smallest rectangular box, called the bound- ing box, which circumscribes a connected component. Those boxes are projected horizontally and/or verti- cally, and local and global projection profiles are an- alyzed for word, textline and text-block segmentation. In the last step of segmentation, the document decom- position hierarchy is produced from these segmented objects. 1 Introduction The printing process is the transformation of the logical hierarchy of a given document into the physi- cal hierarchy. The process must follow the set of rules or protocols which prescribe the physical document layout requirements at the time of production. The requirements may include the font type, size and style for each symbol, the column format (including the number of columns and column width), the header, the footer and margin dimensions. Also, there are also intrinsic spacing protocols for the symbols and words as well as for textlines, text blocks and text columns. In almost all cases, spacings between sym- bols are much smaller than spacings between words within the same printed document. Similarly, spacings between textlines are smaller than spacings between text-blocks and/or text-columns. This tendency has been used as prior knowledge in most OCR and doc- ument image analysis algorithms. This paper describes a technique for extracting words, textlines and text blocks by analyzing the spa- tial configuration of the bounding boxes of symbols in a given document page. In particular, the ‘bounding boxes’ of the connected-components of black pixels are used as the basis of such extractions. The remainder of this Ipaper is organized as follows: In Section 2, we describe the decomposition algorithm in a step-by-step manner. Section 3 discusses experi- ments on the UW English Document Image Database I. The concluding remarks are given in Section 4. 2 Text Zone Delineation Now we describe the page decomposition algorithm in a step-by-step manner. We assume that the input document image has been correctly deskewed. 2.1 Bounding Boxes of Connected Com- ponents Let I denote the input binary image. A connected component analysis algorithm [2] is applied to the fore- ground region of I to produce the set of connected components. Then, for each connected component, its associated bounding box - the smallest rectangu- lar box which circumscribes the component, - is calcu- lated. A bounding box can be represented by giving the coordinates of the upper left and the lower right corners of the box. Figure l(a) shows a segment of an English docu- ment image (taken from the UW English Document Image Database I, page id “L006SYN.TIF”) and Fig- ure l(b) shows the bounding boxes produced in this step. Note that, the number of bounding boxes are O-8186-7128-9/95 $4.000 1995IEEE 1119 Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR '95) 0-8186-7128-9/95 $10.00 © 1995 IEEE