Journal of Intelligent Information Systems, 14, 175–198, 2000 c 2000 Kluwer Academic Publishers. Printed in The Netherlands. Machine Learning for Intelligent Processing of Printed Documents FLORIANA ESPOSITO esposito@di.uniba.it DONATO MALERBA malerba@di.uniba.it FRANCESCA A. LISI lisi@di.uniba.it Dipartimento di Informatica, Universit` a degli Studi di Bari, via Orabona 4, 70125 Bari, Italy Abstract. A paper document processing system is an information system component which transforms infor- mation on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first- order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents. Keywords: learning and knowledge discovery, intelligent information systems, intelligent document processing, decision-tree learning, first-order rule induction 1. Introduction One of the key issues regarding the use of information systems is the acquisition of new information, which often resides in paper documents. In order to provide a suitable solu- tion to this problem, information systems will have to be integrated with paper document processing systems, which are devised to transform printed or handwritten documents into a computer-revisable form. Since the 1960’s, much research on paper document processing has focused on optical character recognition (OCR). In the last decade, it has been widely recognized that text acquisition by means of OCR is only one step of document processing, which also includes the separation of text from graphics, the classification of documents, the identification (or semantic labelling) of some relevant components of the page layout and the transformation of the document into an electronic format. In the literature, the process of breaking down the bitmap of a scanned paper document (document image) into several layout components is called document analysis, while the process of attaching semantic (or logic) labels to some layout components is named document understanding (Tang et al., 1994). Furthermore, the term document classification has been introduced to identify the process of attaching a semantic label (a class name) to the whole document (see figure 1) (Esposito et al., 1990).