Fine-grained Document Genre Classification using First Order Random Graphs Andrew D. Bagdanov Marcel Worring andrew@science.uva.nl worring@science.uva.nl Intelligent Sensory Information Systems University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract We approach the general problem of classifying machine-printed documents into genres. Layout is a crit- ical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary im- ages of the document pages, using no OCR results and mini- mal a priori knowledge of document logical structures. Our method uses attributed relational graphs (ARGs) to repre- sent the layout structure of document instances, and a first order random graphs (FORGs) to represent document gen- res. In this paper we develop our FORG-based genre clas- sification method and present a comparative evaluation be- tween our technique and a variety of statistical pattern clas- sifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to sig- nificantly outperform traditional pattern classification tech- niques when fine-grained genre distinctions must be drawn. 1 Introduction The general unrestricted problem of document under- standing is extremely difficult. One cause of this difficulty is the wide diversity of document types, or genres, within the domain of document processing systems. Even in the restricted domain of machine printed documents, there ex- ists a daunting variety of document types. Moreover, the complexity of modern document analysis systems is such that incremental improvements in individ- ual components have little overall effect on the performance of the entire system. Many researchers are turning to model- directed document processing in order to obtain highly ac- curate solutions to document understanding problems in re- stricted domains [2]. Researchers have been able to con- struct and tune models to solve difficult problems in table understanding [2], business letter analysis [5], office mail flow automation [11], and postal automation [9]. These spe- cialized solutions, however, leave the general problem of document understanding unchanged, since the same diver- sity of documents continue to move through a typical office workflow. A central problem in document image understanding then becomes the automatic determination of document genre, so that an appropriate model can be selected for fur- ther processing. We adopt the following definition of docu- ment genre: Definition 1 A document genre is a category of documents characterized by similarity of expression, style, form, or content. This definition is neccesarily broad, as there are many dis- tinct elements that comprise the genre of a specific docu- ment. The elements related to expression and content are intrisically content-based, while those related to form and style are visual in nature. There is no single, universal partitioning of the universe of paper documents into a set of disjoint genres. Document genre is intrinsically use-specific. Since knowledge of docu- ment genre can guide much of the document understanding process, including OCR, genre classification based on min- imal logical information extraction is particularly desirable. We are specifically interested in the visual components of genre (i.e. style and form). We can think of document genre distinctions being coarse-grained (e.g. business letter from technical article), or fine-grained, (e.g. PAMI article from CACM article). The choice of fine– versus coarse–grained genre classification depends on the application. In most document analysis systems the genre classifica- tion component is considered to be an intrinsic part of the logical information extraction phase. We take the view that genre classification plays the important role of bridging the gap between layout analysis and logical information anal- ysis. This allows the logical information analysis phase of the document understanding process to adapt to the genre of a document being processed. 1