ADAPTIVE GENERIC CLASSIFIER FOR STRUCTURED DOCUMENTS Hamam Mokayed and Azlinah Hj. Mohamed Universiti Teknology Mara UiTM, Faculty of Information Technology and Quantitative Sciences Universiti Teknologi Mara (UiTM) 40450 Shah Alam, Selangor Darul Ehsan, Malaysia Homammo@gmail.com , Azlinah@tmsk.uitm.edu.my ABSTRACT Structured documents as forms, cheques, and slips are used widely in all sectors and have an inherently high error rate (ERR) which is mainly due to many factors as inconsistent of human while filling the documents manually, different written language used all over the world to fill up the required information, different structure and layouts for each document. In document classification systems, not only it is difficult to keep the ERR low, finding features that differentiate the documents that are almost similar is considered as another tough challenge. Finding a generic solution for a different written language forms and solving the previous mentioned obstacles poses a great challenge in the development of more robust structured document classification system. In this paper, an adaptive generic document classification engine is proposed based on building a unique sequence of discrete symbols out of the structured document’s features and implementing a dynamic time wrapping (DTW) algorithm to calculate the similarities between the sequence of symbols of the tested document and all the saved sequence of symbols for all the templates and providing the decision. This novel technique of building a sequence of different symbols extracted out of a unique features and using a DTW algorithm to classify the input shows a higher level of robustness with improved ERR. KEY WORDS Document classification, Feature Extraction, DTW. 1. Introduction The need to classify the document is required in many areas. Among the major areas are the banking sectors, universities, and government offices. Among different techniques to classify the document [1-3], structured document classification system based on layout is not widely used. One major drawback is due to its high error rate (ERR) because of the difficulty to define a similarity measure in a real situation as the tested document might be tilted [4], noise corrupted [5], and manually edited documents as test sets using different schemes [6]. Another major problem in structured document classification system is many documents related to the same sector might have a standard layout. In order to overcome these problems, the study aims to propose an Adaptive generic document classification engine which enables the incoming document to be ordered and classified via template regardless the written language and direct them to specific departments, people, or another automated system for processing. In this work we develop a structured document classifier based on applying DTW over the features extracted out of the reference lines and distinctive blobs. A dynamic tilting technique based on clustering has been proposed [7] as a preprocessing stage before the feature extraction and classification. To ensure higher accuracy and lower ERR, the system implemented adaptive thresholding method to binarize the image before starting the whole process. The block diagram is shown as in Fig. 1. 2. System Modules The proposed system consists of the following seven modules: 2.1. Data acquisition module It’s difficult to give comparable results against other researchers and commercial software suppliers due to lack of a uniform benchmark and confidentiality clauses for most of the forms and the checks. For the previous mentioned reason, self-process of collecting different forms and cheques is accomplished to get a reliable data set for evaluation and testing purposes. 2.2. Pre-processing module The pre-processing part of the system consists of three steps as is shown in pink color in Figure 1 2.2.1. Size normalization stage In order to avoid the problems caused by extracting different features out of different sized samples. A resizing step is recommended as an initial stage to the whole system [8-9] in order to resize the scanned document horizontally and vertically [500 * 500]. Proceedings of the IASTED International Conference February 17 - 19, 2014 Innsbruck, Austria Artificial Intelligence and Applications (AIA 2014) DOI: 10.2316/P.2014.816-005 363