Journal of Intelligent Learning Systems and Applications, 2013, 5, 108-114
http://dx.doi.org/10.4236/jilsa.2013.52012 Published Online May 2013 (http://www.scirp.org/journal/jilsa)
Automatic Classification of Unstructured Blog Text
Mita K. Dalal
1
, Mukesh A. Zaveri
2
1
Information Technology Department, Sarvajanik College of Engineering & Technology, Surat, India;
2
Computer Engineering De-
partment, S. V. National Institute of Technology, Surat, India.
Email: parikhmita@gmail.com, mazaveri@coed.svnit.ac.in
Received December 14
th
, 2012; revised February 16
th
, 2013; accepted February 24
th
, 2013
Copyright © 2013 Mita K. Dalal, Mukesh A. Zaveri. This is an open access article distributed under the Creative Commons Attribu-
tion License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
ABSTRACT
Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the
blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their
textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing
steps like tokenization, stop-word elimination and stemming; statistical techniques for feature set extraction, and feature
set enhancement using semantic resources followed by modeling using two alternative machine learning models—the
naïve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classi-
fication approach has resulted in good overall classification accuracy over unstructured blog text datasets with both
machine learning model alternatives. However, the naïve Bayesian classification model clearly out-performs the ANN
based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent
and the number of training datasets available is restricted.
Keywords: Automatic Blog Text Classification; Feature Extraction; Machine Learning Models; Semi-Supervised
Learning
1. Introduction
Automatic classification of blog entries is generally
treated as a semi-supervised machine learning task, in
which the blog entries are automatically assigned to one
of a set of pre-defined classes based on the features ex-
tracted from their textual content. Usually this task in-
volves several subtasks in natural language processing
like tokenization, stop-word removal, stemming and
spell-error correction followed by feature set construc-
tion, modeling using an appropriate machine learning
technique and finally, classification using the trained
model.
Blogging is a popular way of communicating, infor-
mation sharing and opining on the Internet. There are
blogs devoted to sports, politics, technology, education,
movies, finance etc. Popular blogs have millions of visi-
tors annually, so they are also important platforms for
mining consumer preferences and targeted advertisement.
Most of the content posted on blogs is textual and un-
structured. Classifying blog text is a challenging task
because blog posts and readers’ comments on them are
usually short, frequently contain grammatical errors and
make use of domain-specific abbreviations and slang
terms which do not match dictionary words. They are
also punctuated inappropriately making tokenization and
parsing using automated tools more difficult. The blog
posts of Internet users are organized in one of three ways
[1]—1) Pre-classified; 2) Semi-classified; or 3) Un-clas-
sified. These three categories are briefly explained next.
1) Pre-classified—Pre-classified blogs have separate
web-pages allocated to each sub-class, so that the content
posted is automatically sorted. For example, a blog that
posts updates on computer technology could have previ-
ously allocated pages for categories like “hardware”,
“software”, “outsourcing”, “jobs” etc.
2) Semi-classified—Semi-classified blogs are those
which have some web-pages pre-classified exclusively
for popular categories, while the rest of the posts appear
as mixed-bag. For example, a sports blog might contain
separate web-pages for popular sports which are often
commented upon, while posts on less popular sports ap-
pear as a jumble, often simply referred to as the category
“Others”.
3) Un-classified—Un-classified blogs contain no fine-
grained classification and allow all blog postings to ap-
Copyright © 2013 SciRes. JILSA