New metrics for evaluating performance in document analysis tasks - application to the table case - Ana Costa e Silva. University of Edinburgh Ana.Costa.e.Silva@ed.ac.uk Abstract Is an algorithm capable of high precision and recall at classifying lines as part of table really good at locating tables? Several document analysis tasks require gluing or cutting certain document elements to form others. The suitability of the commonly used precision and recall for such division/aggregation tasks is arguable, since their underlying assumption is that the granularity of the items at input is the same as at output. We propose new evaluation metrics especially suited for this type of tasks, and show their application in several table tasks. In the process we present robust table location and cell segmentation algorithms. 1. Introduction Originally developed for information retrieval and later adapted for classification purposes, Precision (P) is the proportion of elements correctly identified w.r.t. the number of identified elements and Recall (R) uses the same numerator w.r.t. the number of target elements. Both metrics’ underlying assumption is that the granularity of items at output is the same as at input. However for many tasks this assumption is not valid and document analysis holds many examples - clipping images from documents, locating objects in images, joining lines into paragraphs, dividing strings into cells or aggregating cells into columns. In this type of task, that we call aggregation/division, given elements are glued or split to form others so inherently the unit of measurement is different at input and output. When using P&R for these tasks, authors must choose between measuring them in relation to input or output units, thus forcing a uniformity between input/output granularities that is unnatural and that becomes uninformative of the type of choices the algorithm represents, eventually to the point of being misleading of its performance, as we well shall see ahead. What is needed is a measure of our ability to, from an input granularity, generate the granularity required at output, i.e. our ability to transform input in outputs. 2. Metrics commonly used in table tasks P&R and their geometric mean (F-measure) have been extremely common metrics in table tasks - for table location (at line [13], [14], cell [16], [18], full table [2], [3], [16], column/row [10] levels), table segmentation ([8]), and functional analysis ([7]). To evaluate location , [4]’s t’eval and [1]’s Table Evaluation Index metrics are positively correlated with the total area that is correctly detected as table and negatively correlated with the total area of non-tables incorrectly detected, their main difference being that the t’eval associates costs to different error types. These metrics are interesting but may be improper outside a location context. [12] have also developed metrics for division/ aggregation tasks, when input granularity is pixels. A number of different eventually complementary metrics is proposed, but these lose the sense of trade-off and interpretability that P&R for example offer, which may be why they are not used more in other than pixel aggregation / division tasks. On the other hand, [5] proposed an evaluation model that is resistant to the difficulties in ground-truthing tables that they had identified in a previous paper, [6]. The output of their table analysis method is a graph model of the table. In parallel, a ground truth graph model is manually created. Three classes of questions are asked to both graphs and the percentage of agreement is measured. The first class aims at evaluating the quality of the segmentation task (how many columns does the table have?); the second evaluates the table’s functional model (how many attributes contain “Open”?); and the third the structural model (mimicking database-type queries). [18] weigh the mistakes made in each class of questions. [8] calls this “functional” evaluation in contrast to the most common “absolute” approaches. However different performances may be reached if different questions are posed, which in an undesirable characteristic.