Author Profiling: Predicting Gender and Age from Blogs, Reviews & Social media Satya Sri Yatam Department of Computer Science & Engineering, Swarnandhra Institute of Engineering & Technology, Seetharampuram, Narsapur, W.G.Dt., A.P. T Raghunadha Reddy, Assoc. Prof. & Head, Department of Computer Science & Engineering, Swarnandhra Institute of Engineering & Technology, Seetharampuram, Narsapur, W.G.Dt., A.P. Abstract: Author profiling aims to determine the gender, age, and mother language, level of education or socio-economic categories of authors by analyzing their published texts. In the recent times, several solutions are being proposed by different researchers focusing primarily on Age and Gender prediction of the authors. In this paper, we propose a Machine Learning approach to determine authors Age and Gender. Our approach uses two types of features: Content based and Style based. In content based features, we considered the words used by authors, in style based features we have considered parts of speech taggers of the words. We evaluated our system using PAN 2014 author profiling dataset on Blogs, Reviews and Social Media data. Keywords: Author Profiling, Classification, Age and Gender Prediction. I. INTRODUCTION Author profiling [3] [6] has received a growing importance due to the enormous impact of the social media on our daily life. Several applications like forensics, internet security, and commercial recommendation systems require information specific to the creator of the content. For example, author profiling can help police identify characteristics of the perpetrator of a crime when there are too few (or too many) specific suspects to consider. Similarly, in the online marketing settings, companies want to accurately predict recommendations that suits to the user interests. For this purpose, they analyze the user activity on social forums like blogs and online product reviews, to mine the demographic information of people. In social media, we are mainly interested in everyday language and how it reflects basic social and personality processes. The increasing accessibility of public blogs, reviews and social media offers new ways to harvest information from texts authored by hundreds of thousands of different authors. In such scenarios, author profiling can be used to study the sociolect aspect, that is, how language is shared by people. The aim of this work is to contribute to the topic of author profiling by experimenting with two features based and a popular machine learning classification algorithm, Support Vector Machine [8]. In this paper, we attempt to exploit these blogs, reviews and social media to find the correlation between the author‘s of various profiles to the language styles used by them. We believe that the ideas used in this work can help to analyze how everyday language reflects basic social and personality traits. In this work, we consider the popular profiling dimensions: Age and Gender. The rest of the paper is organized in the following order: Section II introduces to the two features used in our approach. Section III describes our experimental settings and evaluation metrics. Section IV discusses our results on PAN 2014 dataset for the author profiling task, and section V concludes this work. II. IDENTIFYING THE CHARACTERSTICS OF THE AUTHOR In this section, we detail the features used in our investigation as well as the classification approach that we adopted. People of different ages write differently due to the variations in the topics of interest and experience gained over several years of practice which might change the writing styles like word choices and grammar rules. For example, females tend to write more about shopping, design and wedding events while males typically tend to write more about sports, finance, technology and politics. Further, studies [5] have shown that females use more adverbs and adjectives while writing compared to males. Therefore, it is good to use features that can differentiate between various writing styles and content of male and female bloggers of different ages. We considered two different types of features that are useful to distinguish between different categories of authors: Content based features and Style based features. CONTENT BASED FEATURES The content based features are important to distinguish between male and female writers [7]. For example, a blog related to cricket is more likely to be written by a male author rather than a female. A blog related to the sports, cricket, typically contains the words like cricket, ODI, test match, innings, six, BCCI, world cup, IPL etc. Thus the occurrence of words like cricket, world cup will increase the chances of it being written by a male rather than a female blogger. Similarly, the occurrence of words or phrases like my husband, flowers, pink, boyfriend etc will International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 www.ijert.org IJERTV3IS120479 Vol. 3 Issue 12, December-2014 631