A Study on Detecting Patterns in Twitter Intra-topic User and Message Clustering Marc Cheong Faculty of IT Monash University Clayton, Australia marc.cheong@infotech.monash.edu.au Vincent Lee Faculty of IT Monash University Clayton, Australia vincent.lee@infotech.monash.edu.au Abstract— Timely detection of hidden patterns is the key for the analysis and estimating of driving determinants for mission critical decision making. This study applies Cheong and Lee’s “context-aware” content analysis framework to extract latent properties from Twitter messages (tweets). In addition, we incorporate an unsupervised Self-organizing Feature Map (SOM) as a machine learning-based clustering tool that has not been investigated in the context of opinion mining and sentimental analysis using microblogging. Our experimental results reveal the detection of interesting patterns for topics of interest which are latent and cannot be easily detected from the observed tweets without the aid of machine learning tools. Keywords- Online documents; Group interaction: analysis of verbal and non-verbal communication; Pattern recognition systems and applications. I. INTRODUCTION Twitter [1] is a popular microblogging platform that has the sole purpose of letting its users express themselves within 140 characters. It is fast gaining momentum across the world. Originally, it was used for the benign reason of sharing information about themselves with friends and family as a form of online ‘presence’ [2] by answering the simple question: What are you doing? Recently, Twitter has evolved from its basic roots to becoming a facilitator to ‘push the message across’. Now, Twitter is used for more serious purposes such as product marketing, political campaigning, citizen journalism, and market research. On the social end of the spectrum, Twitter is used to connect with other people with same interests, spread Internet-based phenomena (memes), and communicate with celebrities. The aforementioned usages of Twitter make it suitable as a source of Web-based collective intelligence that is useful in gathering opinions and information for effective decision making. Aside from the domain of Twitter message contents (tweets) and chatter, the Twitter user base itself gives us insight into the collective wisdom of microbloggers. In this paper, we use a novel approach to discover user demography, habits, and sentiments when contributing to popular topics of discussion on Twitter. We directly use the Twitter-supplied user information and message information for tweets that match a specified topic and attempt to discover pattern commonalities in the user base and their Twitter habits. This allows us to identify niche communities which contribute to a topic, cluster them according to similarities – in demography, usage habits, and sentiments, and visualize such clusters. Our research contributes to the knowledge and practice of microblogging; to our best knowledge, there is no prior work done on the discovery of the latent properties of Twitter communities ‘within’ certain topics. II. RELATED WORK Work on Twitter in academia has been limited due to Twitter being a relative newcomer in the social media scene. Related work (since 2008) in studying the dynamics of the Twitter community have been in the domain of user intentions and ‘tweeting’ style (Mischaud [2]; Java et al. [3]). Studies on the emergent properties of Twitter have been conducted by Huberman et al. [4], and Java et al. [3] who mainly cover the aspect of the social networking pattern exhibited by Twitter users. The conclusions derived from these papers indicate Twitter and other such networks are utilized by users to fulfill information needs, foster connections with others, and share knowledge. Cheong & Lee [5] have studied the emergent properties of users chatting about ‘trending topics’ (trends), in terms of demographics which closely relate to the specific ‘trending topic’. They have also proposed a framework for automated extraction and analysis of demographics and usage habits related to any given topic on Twitter [6]. III. METHODOLOGY This paper applies Cheong & Lee’s framework [6] in detecting and clustering user/messaging patterns in three corpuses of messages, i.e. political activism, world news, and popular technology. This is based on their data-collection framework using a modified method from [5] in conjunction with the Kohonen Self-Organizing Map [7] algorithm. This paper builds upon the case studies mentioned in [6] and clarifies certain points not evident in those case studies, by evaluating the effectiveness of visual clustering, comparing it to traditional naïve clustering methods, and re- evaluating the accuracy of prediction of banned users (defined in Section III.B.2). 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.765 3117 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.765 3129 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.765 3125 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.765 3125 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.765 3125