On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking Loulwah AlSumait, Daniel Barbar´ a, Carlotta Domeniconi Department of Computer Science George Mason University Fairfax - VA, USA lalsumai@gmu.edu, dbarbara@gmu.edu, carlotta@cs.gmu.edu Abstract This paper presents Online Topic Model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling frame- work, specifically the Latent Dirichlet Allocation (LDA) model, to work in an online fashion such that it incremen- tally builds an up-to-date model (mixture of topics per doc- ument and mixture of words per topic) when a new doc- ument (or a set of documents) appears. A solution based on the Empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the top- ics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantita- tively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the orig- inal LDA in predicting the likelihood of unseen documents. 1 Introduction As electronic documents become available in streams over time, their content contains a strong temporal order- ing. Considering the time information is essential to better understand the underlying topics and track their evolution and spread within their domain. In addition, instead of ana- lyzing large collections of time-stamped text documents as archives in an off-line fashion, it is more practical for gen- uine applications to analyze, summarize, and categorize the stream of text data at the time of its arrival. For example, as news arrive in streams, organizing it as threads of rele- vant articles is more efficient and convenient. In addition, there is a great potential to rely on automated systems to track current topics of interest and identify emerging trends in online digital libraries and scientific literature. Identi- fying these stemming topics is essential for selecting and establishing state-of-the-art research projects and business entrepreneurships that would be attractive. Probabilistic topic modeling is a relatively new approach that is being successfully applied to explore and predict the underlying structure of discrete data, such as text. A topic model, such as the Probabilistic Latent Semantic Indexing (PLSI) proposed by Hofmann [9], is a statistical genera- tive model that relates documents and words through latent variables which represent the topics [14]. By considering a document as a mixture of topics, the model is able to gen- erate the words in a document given the small set of la- tent variables (or topics). Inverting this process, i.e. fitting the generative model to the observed data (words in doc- uments), corresponds to inferring the latent variables and, hence, learning the distributions of underlying topics. Latent Dirichlet Allocation (LDA) [2] extends the gen- erative model to achieve the capacity of generalizing the topic distributions so that the model can be used to gen- erate unseen documents as well. LDA considers the top- ics to be multinomial distributions over the words, and as- sumes the documents to be sampled from a random mix- tures of these topics. To complete its generative process for the documents, LDA considers Dirichlet priors for the doc- ument distributions over topics and the topic distributions over words. This paper presents an online version of LDA that auto- matically captures the thematic patterns and identifies top- ics of text streams and their changes over time. Our ap- proach allows LDA model to work in an online fashion such