International Conference on Bangla Speech and Language Processing(ICBSLP), 21-22 September, 2018 Reducing feature space and analyzing effects of using non linear kernels in SVM for Bangla news categorization Quazi Ishtiaque Mahmud Computer Science and Engineering Shahjalal University of Science and Technology Sylhet, Bangladesh rafisustcse@gmail.com Noymul Islam Chowdhury Computer Science and Engineering Shahjalal University of Science and Technology Sylhet, Bangladesh shhoroth120@gmail.com Md Masum Computer Science and Engineering Shahjalal University of Science and Technology Sylhet, Bangladesh masum-cse@sust.edu Abstract —Text categorization is a trending topic nowadays. In this paper, we analyzed some existing approaches for Bangla document categorization and proposed some modifications of them. Using our modified approach we achieved an accuracy of 92.79% which is the best accuracy so far in the dataset that we used which consists of more than 30000 documents. In short we used tf-idf mixed with term frequency threshold as our feature selection technique and SVM as our classifier to classify the documents. We also greatly reduced the feature space and computation time using our approach. Index Terms— SVM, TF-IDF, Uni-gram, Feature Matrix, Polynomial kernel, RBF kernel, TF Threshold, DF Threshold I. INTRODUCTION Although Bangla is one of the most popular languages in the world, not many works have been done regarding Bangla document categorization. In our literature review stage we have gone through some previous works done by others. Authors of [1] had done some work on english document categorization. They handled categorization in two phases, hypothesization and confirmation. They run a set of 500 story through the system and claimed an average recall rate of 93%. Also experimental work on english has been done by some people namely Durgesh K. Srivastava, Lekha Bhambhu [2] where they have used SVM to separate classes and also there is use of probabilistic measures as done by authors Ting-Fan Wu, Chih-Jen Lin [3] where they have generated probability by taking two classes at a time and used SVM to classify, whereas author Thorsten Joachims emphasized on features that are actually useful [4]. Authors of [5] have done some work on Bangla document categorization. They have used Chi-square distribution with Naive-bayes approach to categorize Bangla documents. The authors of [6] extended the work of authors [5] . They had used Support Vector Machine with Tf-Idf algorithm to categorize Bangla documents. They also designed their model for 12 categories. They claimed 92.57% accuracy on their dataset using Tf-Idf with SVM. Authors of [7] used Naive Bayes classifier to classify Bangla documents. Authors 978-1-5386-8207-4/18/$31.00 ©2018 IEEE of [8] also did some work on Bangla text categorization. They used N-gram based approach for categorizing Bangla documents. II. DATASET ANALYSIS In all of our experiments we have used the corpus in [9]. We have used this corpus so that our research remains consistent with the previous best work which was done by authors Saiful et al., 2017 [6]. It has data about 12 categories. The corpus is balanced meaning it contains almost the same amount of documents for all categories. The categories are: art, accident, crime, economics, education, entertainment, environment, international, opinion, politics, sports and tech- nology. III. METHODOLOGY A. Data Preprocessing Data preprocessing is one of the most important tasks before classifying our documents. Following are some of the steps of data preprocessing that we performed: • Detecting words from documents. • Removing some symbols that play no role on document categorization. For example words like (’;’,’,’,’:’ etc.). • Stemming the data. Stemming is the process in which a word is cut to its root word. As an example "সনায়" becomes "সনা". For stemming we used the stemmer that was developed by authors Urmi et al., 2016 [10]. • Lastly removing all pronouns, conjunctions from our dataset. B. Feature Extraction For feature extraction we used the TF-IDF score of every uni-gram in our dataset. TF-IDF is calculated using the following formula, tf-idf(t,d) = tf(t,d) × idf(t). (1) where term frequency tf(t,d) means the number of times a word or term(t) appear in the document(d). We then