Digital Object Identifier (DOI) 10.1007/s10032-004-0122-7 IJDAR (2004) 6: 167–180 Problem-adaptable document analysis and understanding for high-volume applications Bertin Klein, Andreas R. Dengel German Research Center for Artificial Intelligence (DFKI), P.O. Box 2080, 67608 Kaiserslautern, Germany Received: 26 June 2003 / Accepted: 17 February 2004 Published online: 16 March 2004 – c Springer-Verlag 2004 Abstract. Although the Internet is increasingly emerg- ing as “the” widespread platform for information in- terchange, day-to-day work in companies still necessi- tates the laborious, manual processing of huge amounts of printed documents. This article presents the system smartFIX, a document analysis and understanding sys- tem developed by the DFKI spin-off insiders technolo- gies. It enables the automatic processing of documents ranging from fixed format forms to unstructured letters of any format. In addition to the architecture, main com- ponents, and system characteristics, we also show some results from the application of smartFIX to medical bills and prescriptions. Keywords: Document analysis – Document analysis and understanding – Document classification – Infor- mation extraction – Table extraction – Extraction op- timization – Extraction verification – Industrial invoice processing 1 Introduction About 1.2 million printed medical bills arrive at the 35 German private health insurance companies every day. Those bills amount to 10% of the German health insur- ance market and they are actually maintained by printed paper bills. Figure 1 shows examples of such bills. Until recently, the processing of these bills was done almost completely manually. In addition to the tedious task of initiating every single payment by transcribing a number of data fields from varying locations on the paper docu- ments into a computer, this had the serious disadvantage that only a small number of inconsistent and overpriced bills were discovered. Conservative estimates predict sav- ings in the range of several hundred million euros each year if this process could be automated reliably. Extension of the version published in Lectures Notes in Computer Science (LNCS), vol. 2423, Springer, Heidelberg, 2002 Correspondence to : B. Klein (e-mail: klein@dfki.de) About 2 years ago, a consortium of German pri- vate health insurance companies ran a public benchmark test of systems for document analysis and understand- ing (DAU) for the private insurance sector. The bench- mark was won by and proved the suitability of smartFIX (smart For Information eXtraction). smartFIX was de- veloped by a spin-off company of the DAU group at DFKI, insiders (www.insiders.de, founded in 1999 by the second coauthor), thinking that the available DAU tech- nology after a decade of focused research was ready to construct a versatile and adaptive DAU system [2,4,5]. After a viable type of application scenario was identified, well-directed research on a feasible combination of avail- able methods and new methods could be accomplished. This background likely explains smartFIX ’s clear suit- ability for the project that insiders and four insurance companies undertook after the benchmark was estab- lished: the development of a standard product for the analysis of printed medical bills – smartFIX healthcare. The first two important facts about the health insur- ance domain are: 1. Bills are more complex than forms. 2. Each individual bill is inspected by a human opera- tor. Therefore, the DAU task for bills requires more than the simpler methods that suffice for forms – a challenge for the DAU technology developers in the project. But at the same time, the insurance auditors can be assured that the economic success of the project does not start only after a breakthrough in the distant future; every little successfully implemented DAU step immediately reduces the human operators’ workload. Every correctly recognized data item saves typing and can be logically and numerically checked. Diagnoses can be automati- cally coded into ICD 10 (the international standard code for diagnoses). Actually, even with no recognition results, the efficient user interface of the result viewer facilitates the processing of scanned bills. In general, smartFIX healthcare is not limited to the domain of medical bills but is applicable also to most kinds of forms and many unstructured documents. In the