Semisupervised Wrapper Choice and Generation for Print-Oriented Documents Alberto Bartoli, Giorgio Davanzo, Eric Medvet, and Enrico Sorio Abstract—Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level. Index Terms—Document management, administrative data processing, business process automation, retrieval models, human- computer interaction, data entry Ç 1 INTRODUCTION D ESPITE the huge advances and widespread diffusion of Information and Communication Technology, manual data entry is still an essential ingredient of many inter- organizational workflows. In many practical cases, the glue between different organizations is typically provided by human operators who extract the desired information from printed documents and insert that information in another document or application. As a motivating example, consider an invoice processing workflow: each firm generates invoices with its own firm-specific template and it is up to the receiver to find the desired items on each invoice, for example, invoice number, date, total, VAT amount. Automating workflows of this kind would involve template-specific extraction rules—i.e., wrappers—along with the ability to 1. select the specific wrapper to be used for each document being processed (wrapper choice), 2. figure out whether no suitable wrapper exists, and 3. generate new wrappers when necessary (wrapper generation). The latter operation should be done promptly and possibly with only one document with a given template as it may not be known if and when further documents with that template will indeed arrive. Existing approaches to information extraction do not satisfy these requirements completely, as clarified below in more detail. In this paper, we propose the design, implementation, and experimental evaluation of a system with all these features. Our system, which we call PATO, extracts predefined items from printed documents, i.e., either files obtained by scanning physical paper sheets, or files generated by a computer program and ready to be sent to a printer. PATO assumes that the appearance of new templates is not a sort of exceptional event but is a part of normal operation. Wrapper generation has received considerable attention by the research community in the recent years, in particular in the context of information extraction from web sources [1], [2], [3], [4], [5], [6], [7]. Wrapper-based approaches fit this scenario very well as they may exploit the syntactic structure of HTML documents. In this work, we focus instead on printed documents, which are intrinsically different from webpages for two main reasons. First, printed documents do not embed any syntactical structure: they consist of a flat set of blocks that have only textual and geometrical features—for example, position on the page, block width and height, text content, and so on. Second, the representation of a document obtained from a paper sheet usually includes some noise, both in geometrical and textual features, due to sheet misalignment, OCR conver- sion errors, staples, stamps, and so on. PATO addresses wrapper generation based on a maximum-likelihood method applied to textual and geometrical properties of the information items to be extracted [8]. The method is semisupervised in that when no suitable wrapper for a document exists, PATO shows the document to an operator which then selects the items to be extracted with point-and- click GUI selections. There are significant differences between web informa- tion extraction and our scenario even in the wrapper choice 208 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014 . The authors are with the Department of Engineering and Architecture (DIA), University of Trieste, Via Valerio 10, 34127 Trieste, Italy. Manuscript received 24 May 2012; revised 11 Oct. 2012; accepted 2 Dec. 2012; published online 28 Dec. 2012. Recommended for acceptance by P. Ipeirotis. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2012-05-0366. Digital Object Identifier no. 10.1109/TKDE.2012.254. 1041-4347/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society