Journal of Intelligent Learning Systems and Applications, 2013, 5, 108-114 http://dx.doi.org/10.4236/jilsa.2013.52012 Published Online May 2013 (http://www.scirp.org/journal/jilsa) Automatic Classification of Unstructured Blog Text Mita K. Dalal 1 , Mukesh A. Zaveri 2 1 Information Technology Department, Sarvajanik College of Engineering & Technology, Surat, India; 2 Computer Engineering De- partment, S. V. National Institute of Technology, Surat, India. Email: parikhmita@gmail.com, mazaveri@coed.svnit.ac.in Received December 14 th , 2012; revised February 16 th , 2013; accepted February 24 th , 2013 Copyright © 2013 Mita K. Dalal, Mukesh A. Zaveri. This is an open access article distributed under the Creative Commons Attribu- tion License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ABSTRACT Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming; statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the naïve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classi- fication approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the naïve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted. Keywords: Automatic Blog Text Classification; Feature Extraction; Machine Learning Models; Semi-Supervised Learning 1. Introduction Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features ex- tracted from their textual content. Usually this task in- volves several subtasks in natural language processing like tokenization, stop-word removal, stemming and spell-error correction followed by feature set construc- tion, modeling using an appropriate machine learning technique and finally, classification using the trained model. Blogging is a popular way of communicating, infor- mation sharing and opining on the Internet. There are blogs devoted to sports, politics, technology, education, movies, finance etc. Popular blogs have millions of visi- tors annually, so they are also important platforms for mining consumer preferences and targeted advertisement. Most of the content posted on blogs is textual and un- structured. Classifying blog text is a challenging task because blog posts and readers’ comments on them are usually short, frequently contain grammatical errors and make use of domain-specific abbreviations and slang terms which do not match dictionary words. They are also punctuated inappropriately making tokenization and parsing using automated tools more difficult. The blog posts of Internet users are organized in one of three ways [1]—1) Pre-classified; 2) Semi-classified; or 3) Un-clas- sified. These three categories are briefly explained next. 1) Pre-classified—Pre-classified blogs have separate web-pages allocated to each sub-class, so that the content posted is automatically sorted. For example, a blog that posts updates on computer technology could have previ- ously allocated pages for categories like “hardware”, “software”, “outsourcing”, “jobs” etc. 2) Semi-classified—Semi-classified blogs are those which have some web-pages pre-classified exclusively for popular categories, while the rest of the posts appear as mixed-bag. For example, a sports blog might contain separate web-pages for popular sports which are often commented upon, while posts on less popular sports ap- pear as a jumble, often simply referred to as the category “Others”. 3) Un-classified—Un-classified blogs contain no fine- grained classification and allow all blog postings to ap- Copyright © 2013 SciRes. JILSA