International Conference on Bangla Speech and Language Processing(ICBSLP), 21-22 September, 2018
Reducing feature space and analyzing effects of using non linear
kernels in SVM for Bangla news categorization
Quazi Ishtiaque Mahmud
Computer Science and Engineering
Shahjalal University
of Science and Technology
Sylhet, Bangladesh
rafisustcse@gmail.com
Noymul Islam Chowdhury
Computer Science and Engineering
Shahjalal University
of Science and Technology
Sylhet, Bangladesh
shhoroth120@gmail.com
Md Masum
Computer Science and Engineering
Shahjalal University
of Science and Technology
Sylhet, Bangladesh
masum-cse@sust.edu
Abstract —Text categorization is a trending topic nowadays.
In this paper, we analyzed some existing approaches for Bangla
document categorization and proposed some modifications of
them. Using our modified approach we achieved an accuracy
of 92.79% which is the best accuracy so far in the dataset
that we used which consists of more than 30000 documents.
In short we used tf-idf mixed with term frequency threshold
as our feature selection technique and SVM as our classifier
to classify the documents. We also greatly reduced the feature
space and computation time using our approach.
Index Terms— SVM, TF-IDF, Uni-gram, Feature Matrix,
Polynomial kernel, RBF kernel, TF Threshold, DF Threshold
I. INTRODUCTION
Although Bangla is one of the most popular languages in
the world, not many works have been done regarding Bangla
document categorization. In our literature review stage we
have gone through some previous works done by others.
Authors of [1] had done some work on english document
categorization. They handled categorization in two phases,
hypothesization and confirmation. They run a set of 500 story
through the system and claimed an average recall rate of
93%. Also experimental work on english has been done by
some people namely Durgesh K. Srivastava, Lekha Bhambhu
[2] where they have used SVM to separate classes and also
there is use of probabilistic measures as done by authors
Ting-Fan Wu, Chih-Jen Lin [3] where they have generated
probability by taking two classes at a time and used SVM to
classify, whereas author Thorsten Joachims emphasized on
features that are actually useful [4].
Authors of [5] have done some work on Bangla document
categorization. They have used Chi-square distribution with
Naive-bayes approach to categorize Bangla documents. The
authors of [6] extended the work of authors [5] . They
had used Support Vector Machine with Tf-Idf algorithm
to categorize Bangla documents. They also designed their
model for 12 categories. They claimed 92.57% accuracy on
their dataset using Tf-Idf with SVM. Authors of [7] used
Naive Bayes classifier to classify Bangla documents. Authors
978-1-5386-8207-4/18/$31.00 ©2018 IEEE
of [8] also did some work on Bangla text categorization.
They used N-gram based approach for categorizing Bangla
documents.
II. DATASET ANALYSIS
In all of our experiments we have used the corpus in
[9]. We have used this corpus so that our research remains
consistent with the previous best work which was done by
authors Saiful et al., 2017 [6]. It has data about 12 categories.
The corpus is balanced meaning it contains almost the same
amount of documents for all categories. The categories are:
art, accident, crime, economics, education, entertainment,
environment, international, opinion, politics, sports and tech-
nology.
III. METHODOLOGY
A. Data Preprocessing
Data preprocessing is one of the most important tasks
before classifying our documents. Following are some of the
steps of data preprocessing that we performed:
• Detecting words from documents.
• Removing some symbols that play no role on document
categorization. For example words like (’;’,’,’,’:’ etc.).
• Stemming the data. Stemming is the process in which
a word is cut to its root word. As an example "সনায়"
becomes "সনা". For stemming we used the stemmer that
was developed by authors Urmi et al., 2016 [10].
• Lastly removing all pronouns, conjunctions from our
dataset.
B. Feature Extraction
For feature extraction we used the TF-IDF score of every
uni-gram in our dataset. TF-IDF is calculated using the
following formula,
tf-idf(t,d) = tf(t,d) × idf(t). (1)
where term frequency tf(t,d) means the number of times
a word or term(t) appear in the document(d). We then