Table Detection via Probability Optimization Yalin Wang 1 , Ihsin T. Phillips 2 , and Robert M. Haralick 3 1 Dept. of Elect. Eng. Univ. of Washington Seattle, WA 98195, US ylwang@u.washington.edu 2 Dept. of Comp. Science, Queens College, City Univ. of NewYork Flushing, NY 11367, US yun@image.cs.qc.edu 3 The Graduate School, City Univ. Of NewYork NewYork, NY 10016, US haralick@gc.cuny.edu Abstract. In this paper, we define the table detection problem as a probability optimization problem. We begin, as we do in our previous algorithm, finding and validating each detected table candidates. We proceed to compute a set of probability measurements for each of the table entities. The computation of the probability measurements takes into consideration tables, table text separators and table neighboring text blocks. Then, an iterative updating method is used to optimize the page segmentation probability to obtain the final result. This new algorithm shows a great improvement over our previous algorithm. The training and testing data set for the algorithm include 1, 125 document pages having 518 table entities and a total of 10, 934 cell entities. Compared with our previous work, it raised the accuracy rate to 95.67% from 90.32% and to 97.05% from 92.04%. 1 Introduction With the large number of existing documents and the increasing speed in the production of multitude new documents, finding efficient methods to process these documents for their content retrieval and storage becomes critical. For the last three decades, the document image analysis researchers have successfully developed many outstanding methods for character recognition, page segmentation and understand of text-based documents. Most of these methods were not designed to handle documents containing complex objects, such as tables. Tables are compact and efficient for presenting relational information and most of the documents produced today contain various types of tables. Thus, table structure extraction is an important problem in the document layout analysis field. A few table detection algorithms have been published in the recent literature ( [1]–[2]). However, the performance of these reported algorithms are not yet good enough for commercial usage. Among the recently published table detection algorithms, some of them are either using a predefined table layout structures [3][4], or relying on complex heuristics for detecting tables ( [5], [6], [7]). Klein et. al. [7] use a signal model to detect tables. Hu et. al. [2] describe an algorithm which detects tables based on computing an optimal partitioning of an input document into some number of tables. They use a dynamic D. Lopresti, J. Hu, and R. Kashi (Eds.): DAS 2002, LNCS 2423, pp. 272–282, 2002. c Springer-Verlag Berlin Heidelberg 2002