Mining Topics in Documents: Standing on the Shoulders of Big Data Zhiyuan Chen and Bing Liu Department of Computer Science University of Illinois at Chicago czyuanacm@gmail.com, liub@cs.uic.edu ABSTRACT Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of doc- uments) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory re- sults. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better top- ics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed al- gorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitiv- ity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing Keywords Topic Model; Lifelong Learning; Opinion Aspect Extraction. 1. INTRODUCTION Topic models, such as LDA [4], pLSA [12] and their ex- tensions, have been popularly used for topic extraction from text documents. However, these models typically need a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD’14, August 24–27, 2014, New York, NY, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2956-9/14/08 ...$15.00. http://dx.doi.org/10.1145/2623330.2623622. large amount of data, e.g., thousands of documents, to pro- vide reliable statistics for generating coherent topics. This is a major shortcoming because in practice few document col- lections have so many documents. For example, in the task of finding product features or aspects from online reviews for opinion mining [13, 19], most products do not even have more than 100 reviews (documents) in a review website. As we will see in the experiment section, given 100 reviews, the classic topic model LDA produces very poor results. To deal with this problem, there are three main approaches: 1. Inventing better topic models : This approach may be ef- fective if a large number of documents are available. How- ever, since topic models perform unsupervised learning, if the data is small, there is simply not enough information to provide reliable statistics to generate coherent topics. Some form of supervision or external information beyond the given documents is necessary. 2. Asking users to provide prior domain knowledge : An ob- vious form of external information is the prior knowledge of the domain from the user. For example, the user can input the knowledge in the form of must-link and cannot- link. A must-link states that two terms (or words) should belong to the same topic, e.g., price and cost. A cannot- link indicates that two terms should not be in the same topic, e.g., price and picture. Some existing knowledge- based topic models (e.g., [1, 2, 9, 10, 14, 15, 26, 28]) can exploit such prior domain knowledge to produce better topics. However, asking the user to provide prior do- main knowledge can be problematic in practice because the user may not know what knowledge to provide and wants the system to discover for him/her. It also makes the approach non-automatic. 3. Learning like humans (lifelong learning ): We still use the knowledge-based approach but mine the prior knowledge automatically from the results of past learning. This ap- proach works like human learning. We humans always retain the results learned in the past and use them to help future learning. That is why whenever we see a new situation, few things are really new because we have seen many aspects of it in the past in some other contexts. In machine learning, this paradigm is called lifelong learn- ing [30, 31]. The proposed technique takes this approach. It represents a major step forward as it closes the learning or modeling loop in the sense that the whole process is now fully automatic and can learn or model continuously. However, our approach is very different from existing life- long learning methods (see Section 2). Existing research has focused on the first two approaches. We believe it is high time to create algorithms and build systems that learn as humans do. Lifelong learning is pos- sible in our context due to two key observations: