Urdu Text Classification Abbas Raza Ali National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan abbas.raza@nu.edu.pk Maliha Ijaz National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan maliha.ijaz@nu.edu.pk ABSTRACT This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy. Keywords Corpus, information retrieval, lexicon, Naïve Bayes, normalization, feature selection, text classification, text mining, Urdu. 1. INTRODUCTION Text classification is a process of classifying unknown text automatically by suggesting most probable class to which it belongs. As electronic information is increasing day by day, it becomes a key technique to organize large amount of data for analysis and processing [9]. Text classification is involved in many applications like text filtering, document organization, classification of news stories, searching for interesting information on the web, etc. These are language specific systems mostly designed for English but no work has been done for Urdu language. So, developing classification systems for Urdu documents is a challenging task due to morphological richness and scarcity of resources of the language like automatic tools for tokenization, feature selection, stemming etc. Two different classifiers based on supervised learning techniques are developed and their accuracies are compared on the given dataset. From the experiments, the Naïve Bayes classifier is found to be more efficient than the Support Vector Machines. However, SVMs outperform Naïve Bayes in context of classification accuracy. The overall system is divided into three main components: 1) Acquisition, compilation and labeling of the text documents of the corpus 2) Preprocessing of raw corpus to generate standardized and reduced-feature lexicon 3) Training of statistical classifiers on the preprocessed data to classify test data Detailed architecture of the system along with its three components is shown in Figure 1. Classification Document Preprocessing Naive Bayes Classifier Training Lexicon based tokenization Estimate p(Term | class) Normalization Estimate p(class) Maximum[p(class | Term)] Stop words elimination Affixes based stemming SVM Classifier Calculate Normalized Term Frequency Maximum[α*r *K(xtest, x)+w0] Corpus acquisition Calculate α, w, w0 Training Classification Diacritics elimination Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FIT’09, December 16–18, 2009, CIIT, Abbottabad, Pakistan. Copyright 2009 ACM 978-1-60558-642-7/09/12....$10. Figure 1. Architecture of Urdu text classification system