Document Image Analysis via Model Checking Marco Aiello Institute for Logic, Language and Computation, and Intelligent Sensory and Information Systems University of Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam, The Netherlands aiellom@ieee.org 1 Introduction When Dave placed his own drawing in front of the ‘eye’ of HAL—in 2001: A Space Odyssey—HAL showed to have correctly comprehended and interpreted the sketch. “That’s Dr. Hunter, isn’t it?” [9]. But what would have happened if Dave used the first page of a newspaper in front of the eye and started discussing its contents? Considering HAL a system capable of AI, we expect HAL to rec- ognize the document as a newspaper, to understand how to extract information and to understand its contents. Finally, we expect Dave and HAL to begin a conversation on the contents of the document. Here we present a methodology based on model checking, which has been successfully experimented on an heterogeneous collection of documents [1, 11], to extract the content from images of documents. We focus on mechanically generated documents, in contrast with hand-writing and sketches. Using terms better-known to the image processing community, we are interested in logical structure detection in the context of document image analysis. Document image analysis is the set of techniques involved in recovering syn- tactic and semantic information from images of documents, prominently scanned versions of paper documents. An excellent survey of document image analysis is provided in [8] where, by going through 99 articles appeared in the IEEE’s Trans- actions on Pattern Analysis and Machine Intelligence, Nagy reconstructs the history and state of the art of document image analysis. Research in document images analysis is useful and studied in connection with document reproduction, digital libraries, information retrieval, office automation, and text-to-speech. There are two distinct tasks in document image analysis. The first has a syntactical goal consisting of the identification of basic components of the docu- ment, the so-called document objects. The second has a semantic goal consisting of the identification of the role and meaning of the document objects in order to achieve an interpretation of the whole original document. The syntactic informa- tion is synthesized in the layout structure of the document, while the semantic information goes under the name of logical structure. In the latter task, two sub- tasks are usually identified: logical labeling, and reading order detection. Logical labeling consists of the assignment to document objects of labels indicating their