An Approach towards Benchmarking of Table Structure Recognition Results Thomas Kieninger German Research Center for AI (DFKI) Andreas Dengel German Research Center for AI (DFKI), CS Dept., University of Kaiserslautern, Germany Abstract After developing a model free table recognition sys- tem we wanted to tune parameters in order to optimize the recognition performance. Therefore we developed a benchmarking environment, including a user frontend to acquire ground truth and mechanisms to evaluate the quality of the recognition results. The tasks involved in the analysis systems were the locating of table regions, identiﬁcation of cells and mapping of cells to rows and columns. This paper presents our approach towards the com- parison of recognition results with the ground truth. The established deﬁnitions of recall and precision did not meet our requirements, as we wanted to register even smallest improvements (or changes in general) in the re- sults, even when both results were imperfect. We there- fore extended the measures recall and precision in order to deal with recognition probabilities of objects rather than just with boolean values. 1. Introduction In the recent years more and more researchers were addressing the topic of table analysis. This area yields a wide variety of facets and hence the early publica- tions under this topic did not always describe compara- ble technologies. For instance, only a fraction of them was capable of locating a table within a document image and recognising the cells, rows and columns while oth- ers concentrated on higher level tasks such as the under- standing or interpretation of the contents. A recent sur- vey on table recognition approaches indicating the many facets and categorizing the different approaches is given in [1]. Once that a small fragment of tasks associated with that topic which was in common for two or more ap- proaches was identiﬁed, it was still not possible to com- pare them against each other as there have yet not been any reference corpora or commonly agreed benchmark- ing approaches established. Unlike for document analysis technologies that apply to non-tabular text (zoning, logical labeling, text cate- gorization, content extraction etc.), there are little or no efforts towards a comparative evaluation of table recog- nizers. The reasons therefore are manifold: one might be, that only a little number of different table recogni- tion approaches have been developed or published yet. These approaches themselves are not always to be com- pared. They either rely on different layout features (how can two systems be compared, if the one relies on table lines the other one relies on known column headers?) or they do not have the same input/output quality. But hav- ing an identical level of I/O-data is a prerequisite for any comparisons. Benchmarking is a typicial activity when compar- isons or quantitative statements about the quality of some analysis task are required. While for common Informa- tion Retrieval (IR) tasks, i.e. classiﬁcation, a multitude of ground truth data and established measures [2] already exist, the ﬁeld of table analysis is still under develop- ment. But even under the presence of sufﬁciently large document collections, table ground-thruthing has some problematic aspects as stated in [3]. The beneﬁts of established benchmarking methods are manifold. First, it is possible to compare alternate approaches in a competitive way in order to ﬁnd the best approach for a speciﬁc class of problems and/or a spe- ciﬁc domain of documents. Return of Investment con- siderations heavily rely on benchmarking results in or- der to give measures to count on — a very important aspect when technology becomes mature and is about to be part of a product. Last but not least, the evolution and/or tuning of a speciﬁc approach wrt. parameter set- tings can strongly beneﬁt, as different versions and para- meter sets can automatically be compared to each other. Thus, benchmarking takes over the role of a supervisor or teacher while parameter sets are optimized. Benchmarking itself is characterized by several sub- tasks: At ﬁrst, one needs some test-collections which