On Extracting Structured Knowledge from Unstructured Business Documents Gaurav Pandey Department of Computer Science University of Minnesota, Twin Cities Minneapolis, MN, USA gaurav@cs.umn.edu Rakshit Daga Design Services Team SAP Labs LLC Palo Alto, CA, USA rakshit.daga@sap.com Abstract Efficient management of text data is a major con- cern of business organizations. In this direction, we propose a novel approach to extract structured knowledge from large corpora of unstructured busi- ness documents. This knowledge is represented in the form of object instances, which are common ways of organizing the available information about entities, and are modeled here using document tem- plates. The approach itself is based on the observa- tion that a significant fraction of these documents are created using the cut-copy-paste method, and thus, it is important to factor this observation into business document analysis projects. Correspond- ingly, our approach solves the problem of object instance extraction in two steps, namely similarity search and then extraction of object instances from the selected documents. Early qualitative results on a couple of carefully selected document corpora in- dicate the effective applicability of the approach for solving an important component of the efficient text management problem. 1 Introduction Text is probably the most common form for knowledge in today’s world. All forms of useful information in various do- mains, be it education, business or education, gets published as books, webpages, papers or some other form of text. Es- pecially, in the domain of business, documents are an integral part of the process, since every official detail has to be doc- umented for various purposes, such as sharing and dissemi- nation, keeping proofs of decisions made and standardizing processes. Thus, it is very important for organizations work- ing in this domain to manage the information contained in these documents effectively. However, despite this importance, an inherent problem with text data is that, in many cases, it is unstructured, i.e., there is no standard format in which information is recorded in a document. This makes it extremely hard for a com- puter to automatically extract useful knowledge from a docu- ment, and thus, business organizations have to employ large amounts of specialized human labor for this task. More so, in large organizations, where the volume of text data is usually very large, even this labor proves to be insufficient. Hence, there is a great need for an automated system that can extract structured knowledge from unstructured text documents. This paper presents a text mining approach that achieves this target for a variety of documents. One useful method to model the structure in a document is via objects. An object is an entity that can be described in terms of individual attributes. For instance, person is an object whose attributes can be name, nationality, profession and income. Instances of this object are persons who have a specified value for each of these attributes. Clearly, once these object instances have been specified, the information so gathered can be stored persistently in an easily accessible and retrievable structure, such as a relational database. Corresponding to this utility of the notion of an object, we hypothesized that many business documents describe in- stances of different types of objects, such as stock purchases, personnel records, business contracts and others. Thus, the solution proposed in this paper attempts to extract these ob- ject instances from documents, so that they can be stored in appropriate storage structures, such as databases, and ac- cessed easily. This task is achieved by modeling an object using a document template and breaking the overall solution into two steps: 1. Similarity search to identify which documents in the given corpus have been created from this template. 2. Extraction of instances from the identified documents. Another useful insight that assisted the implementation of these steps was the method of creation of the documents, i.e. direct vs indirect creation. Briefly, the former method refers to the cut-copy-paste method of document creation, while the latter indicates the use of the template as a guide for the prepa- ration of the document. In accordance with this insight, the traditional document similarity is modified to include a diff - based similarity, which estimates the extent of direct creation in a document. This similarity is also used in the extraction step to identify the most likely attributes values for object in- stances in a document. The complete approach is described in detail later. The rest of the paper is organized as follows. Section 2 dis- cusses related work in the field of structure extraction from text. Section 4 discusses our approach for this problem in detail, and Section 3 explains the necessary background con- 155 AND 2007 155