Categorizing Blogger’s Interests Based on Short Snippets of Blog Posts Jiahui Liu, Larry Birnbaum, Bryan Pardo Northwestern University 2133 Sheridan Road, Evanston, IL, 60201, USA j-liu2@northwestern.edu, {birnbaum, pardo}@cs.northwestern.edu ABSTRACT Blogs have become an important medium for people to express opinions and share information on the web. Predicting the interests of bloggers can be beneficial for information retrieval and knowledge discovery in the blogosphere. In this paper, we propose a two-layer classification model to categorize the interests of bloggers based on a set of short snippets collected from their blog posts. Experiments were conducted on a list of bloggers collected from blog directories, with their snippets collected from Google Blog Search. The results show that the proposed method is robust to errors in the lower level and achieve satisfactory performance in categorizing blogger’s interests. Categories and Subject Descriptors I.5.2 [Pattern Reorganization]: Design Methodology-Classifier design and evaluation. General Terms Algorithms, Experimentation, Performance. Keywords Blogger, interests, categorization 1. INTRODUCTION As an important form of online publishing for common internet users, blogs have emerged as a dynamic and diversified medium for information creation, distribution and accumulation. Bloggers who are interested in certain domains maintain blog sites to publish news, opinions and ideas about the domains of their interest. Identifying and categorizing the interests of bloggers can be valuable for information retrieval and knowledge discovery from blogs. The blog posts published by bloggers provide important clues for predicting their interests. However, direct categorization of all the texts written by a blogger cannot produce accurate prediction. This is mainly due to two reasons. First, blog articles are written in an informal erratic style. Bloggers sometimes even invent new words and grammars to express themselves idiosyncratically. Second, bloggers do not confine themselves to one topic [2]. Therefore, the mixture of all the posts by a blogger is a multi- topic and noisy text document that is difficult to classify. To address these challenges, we propose a two-layer classification model to categorize blogger’s interests. In the first layer, a text classifier is trained to predict the probability of a blog post belonging to a domain category. Although the classification of individual post is not perfect, the categorizations of multiple posts of a blogger provide important information to predict the overall interests of that blogger. In the second layer, we derive features from the set of categorization probabilities of the posts written by a blogger. Those features are used to categorize the interests of the blogger. By incorporating the membership information of blog posts regarding all the categories, the second layer classifiers are able to learn the topical correlations among these categories. We experiment with the proposed model using a collection of bloggers compiled from blog directories, with blog post snippet retrieved from Google Blog Search. As we expected, classification of short snippets in the first layer is not very accurate. But the features derived from the set of probabilities for all the snippets are meaningful and useful for predicting the overall interests of bloggers. Categorization of bloggers’ interests achieves F1 measure of 0.845 by microaveraging over all the categories. 2. THE PROPOSED TECHNIQUE In this paper, we use short snippets of a blogger’s posts to characterize their interests. Using the snippets eliminates the need to download the full web page. Snippets are also faster to process than full page, enabling real time processing, which is especially critical for web applications. For each blogger b, we collect a set of most recent blog posts written by b, denoted as { } n p p p b P ,..., , ) ( 2 1 = . For each post p i , we extract a short snippet ) ( i i p Snippet s = . A snippet consists of the title and the first few sentences of a blog post, containing about 40 words. { } 1 2 , ,..., b n S s s s = is the set of snippets collected for blogger b. The goal is to categorize the interests of b into one or multiple classes, drawn from a set of classes, { } m c c c C ,..., , 2 1 = . The proposed technique addresses this task with a two-layer classification model. In the first layer, the classifiers produce a probability estimate ( | ) j i pc s for each post snippet s i , which is the probability that snippet s i belongs to category c j . In the second layer, we derive features from the categorization probabilities for all the snippets written by blogger b and use these features to predict the interests of b. 2.1 Categorizing Snippets of Blog Posts To build text classifiers of snippets, we take the stemmed content words of snippets as features, with stop words removed. For each category c j , we selected the most predictive 2000 stemmed words according to Information Gain [6]. To categorize the snippets, we use the sequential minimal model (SMO) [3], which has been shown to be efficient and effective for text classification. The output of SVM is fit to a sigmoid model to derive a proper probability estimate of membership [4]. Copyright is held by the author/owner(s). CIKM’08, October 26–30, 2008, Napa Valley, California, USA. ACM 978-1-59593-991-3/08/10.