A Study on Detecting Patterns in Twitter Intra-topic User and Message Clustering
Marc Cheong
Faculty of IT
Monash University
Clayton, Australia
marc.cheong@infotech.monash.edu.au
Vincent Lee
Faculty of IT
Monash University
Clayton, Australia
vincent.lee@infotech.monash.edu.au
Abstract— Timely detection of hidden patterns is the key for
the analysis and estimating of driving determinants for mission
critical decision making. This study applies Cheong and Lee’s
“context-aware” content analysis framework to extract latent
properties from Twitter messages (tweets). In addition, we
incorporate an unsupervised Self-organizing Feature Map
(SOM) as a machine learning-based clustering tool that has not
been investigated in the context of opinion mining and
sentimental analysis using microblogging. Our experimental
results reveal the detection of interesting patterns for topics of
interest which are latent and cannot be easily detected from the
observed tweets without the aid of machine learning tools.
Keywords- Online documents; Group interaction: analysis of
verbal and non-verbal communication; Pattern recognition
systems and applications.
I. INTRODUCTION
Twitter [1] is a popular microblogging platform that has
the sole purpose of letting its users express themselves
within 140 characters. It is fast gaining momentum across the
world. Originally, it was used for the benign reason of
sharing information about themselves with friends and
family as a form of online ‘presence’ [2] by answering the
simple question: What are you doing?
Recently, Twitter has evolved from its basic roots to
becoming a facilitator to ‘push the message across’. Now,
Twitter is used for more serious purposes such as product
marketing, political campaigning, citizen journalism, and
market research. On the social end of the spectrum, Twitter
is used to connect with other people with same interests,
spread Internet-based phenomena (memes), and
communicate with celebrities.
The aforementioned usages of Twitter make it suitable as
a source of Web-based collective intelligence that is useful in
gathering opinions and information for effective decision
making. Aside from the domain of Twitter message contents
(tweets) and chatter, the Twitter user base itself gives us
insight into the collective wisdom of microbloggers.
In this paper, we use a novel approach to discover user
demography, habits, and sentiments when contributing to
popular topics of discussion on Twitter. We directly use the
Twitter-supplied user information and message information
for tweets that match a specified topic and attempt to
discover pattern commonalities in the user base and their
Twitter habits. This allows us to identify niche communities
which contribute to a topic, cluster them according to
similarities – in demography, usage habits, and sentiments,
and visualize such clusters.
Our research contributes to the knowledge and practice
of microblogging; to our best knowledge, there is no prior
work done on the discovery of the latent properties of
Twitter communities ‘within’ certain topics.
II. RELATED WORK
Work on Twitter in academia has been limited due to
Twitter being a relative newcomer in the social media scene.
Related work (since 2008) in studying the dynamics of the
Twitter community have been in the domain of user
intentions and ‘tweeting’ style (Mischaud [2]; Java et al.
[3]).
Studies on the emergent properties of Twitter have been
conducted by Huberman et al. [4], and Java et al. [3] who
mainly cover the aspect of the social networking pattern
exhibited by Twitter users. The conclusions derived from
these papers indicate Twitter and other such networks are
utilized by users to fulfill information needs, foster
connections with others, and share knowledge.
Cheong & Lee [5] have studied the emergent properties
of users chatting about ‘trending topics’ (trends), in terms of
demographics which closely relate to the specific ‘trending
topic’. They have also proposed a framework for automated
extraction and analysis of demographics and usage habits
related to any given topic on Twitter [6].
III. METHODOLOGY
This paper applies Cheong & Lee’s framework [6] in
detecting and clustering user/messaging patterns in three
corpuses of messages, i.e. political activism, world news, and
popular technology. This is based on their data-collection
framework using a modified method from [5] in conjunction
with the Kohonen Self-Organizing Map [7] algorithm.
This paper builds upon the case studies mentioned in [6]
and clarifies certain points not evident in those case studies,
by evaluating the effectiveness of visual clustering,
comparing it to traditional naïve clustering methods, and re-
evaluating the accuracy of prediction of banned users
(defined in Section III.B.2).
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.765
3117
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.765
3129
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.765
3125
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.765
3125
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.765
3125