Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space Gopi Chand Nutakki Knowledge Discovey & Web Mining Lab, University of Louisville g0nuta01@louisville.edu Olfa Nasraoui Knowledge Discovey & Web Mining Lab, University of Louisville olfa.nasraoui@louisville.edu Behnoush Abdollahi Knowledge Discovey & Web Mining Lab, University of Louisville b0abdo03@louisville.edu Mahsa Badami Knowledge Discovey & Web Mining Lab, University of Louisville m0bada01@louisville.edu Wenlong Sun Knowledge Discovey & Web Mining Lab, University of Louisville w0sun005@louisville.edu ABSTRACT We describe the methodology that we followed to automat- ically extract topics corresponding to known events pro- vided by the SNOW 2014 challenge in the context of the SocialSensor project. A data crawling tool and selected fil- tering terms were provided to all the teams. The crawled data was to be divided in 96 (15-minute) timeslots spanning a 24 hour period and participants were asked to produce a fixed number of topics for the selected timeslots. Our preliminary results are obtained using a methodology that pulls strengths from several machine learning techniques, in- cluding Latent Dirichlet Allocation (LDA) for topic model- ing and Non-negative Matrix Factorization (NMF) for auto- mated hashtag annotation and for mapping the topics into a latent space where they become less fragmented and can be better related with one another. In addition, we obtain im- proved topic quality when sentiment detection is performed to partition the tweets based on polarity, prior to topic mod- eling. Keywords Topic Modeling, LDA, NMF, Social Media Mining 1. INTRODUCTION The SNOW 2014 challenge was organized within the con- text of the SocialSensor project 1 , which works on developing a new framework for enabling real-time multimedia index- ing and search in the Social Web. The aim of the challenge was to automatically extract topics corresponding to known events that were prescribed by the challenge organizers. Also 1 SocialSensor: http://www.socialsensor.eu/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SNOW ’14 April 7, 2014, Seoul, Korea. Copyright 2014 ACM ...$15.00. Figure 1: Topic Modeling Framework (sentiment de- tection and hashtag annotation are not shown). provided, was a data crawling tool along with several Twit- ter filter terms (syria, ukraine, bitcoin, terror). The crawled data was to be divided in a total of 96 (15-minute) timeslots spanning a 24 hour period, with a goal of extracting fixed number of topics in each timeslot. Only tweets up to the end of the timeslot could be used to extract any topic. In this paper, we focused on the topic extraction task, instead of the presentation of associated headline, tweets and image URL, because this was one of the activities closest to the ongoing research [2, 8, 7] on multi-domain data stream clustering in the Knowledge Discovery & Web Mining Lab at the Univer- sity of Louisville. To extract topics from the tweets crawled in each time slot, we use a Latent Dirichlet Allocation (LDA) based technique. We then discover latent concepts using Non-negative Matrix Factorization (NMF) on the resulting topics, and apply hierarchical clustering within the resulting Latent Space (LS) in order to agglomerate these topics into less fragmented themes that can facilitate the visual inspec- tion of how the different topics are inter-related. We have also experimented with adding a sentiment detection step prior to topic modeling in order to obtain a polarity sensi- tive topic discovery, and automated hashtag annotation to improve the topic extraction.