DRAFT Automatic Topic Labeling using Ontology-based Topic Models Mehdi Allahyari Computer Science Department University of Georgia, Athens, GA Email: mehdi@uga.edu Krys Kochut Computer Science Department University of Georgia, Athens, GA Email: kochut@cs.uga.edu Abstract—Topic models, which frequently represent topics as multinomial distributions over words, have been extensively used for discovering latent topics in text corpora. Topic labeling, which aims to assign meaningful labels for discovered topics, has recently gained significant attention. In this paper, we argue that the quality of topic labeling can be improved by considering ontology concepts rather than words alone, in contrast to previous works in this area, which usually represent topics via groups of words selected from topics. We have created: (1) a topic model that integrates ontological concepts with topic models in a single framework, where each topic and each concept are represented as a multinomial distribution over concepts and over words, respectively, and (2) a topic labeling method based on the ontological meaning of the concepts included in the discovered topics. In selecting the best topic labels, we rely on the semantic relatedness of the concepts and their ontological classifications. The results of our experiments conducted on two different data sets show that introducing concepts as additional, richer features between topics and words and describing topics in terms of concepts offers an effective method for generating meaningful labels for the discovered topics. KeywordsStatistical learning, topic modeling, topic model labeling, DBpedia ontology I. I NTRODUCTION Topic models such as Latent Dirichlet Allocation (LDA) [1] have gained considerable attention, recently. They have been successfully applied to a wide variety of text mining tasks, such as word sense disambiguation [2], sentiment analysis [3] and others, in order to identify hidden topics in text documents. Topic models typically assume that documents are mixtures of topics, while topics are probability distributions over the vocabulary. When the topic proportions of documents are estimated, they can be used as the themes (high-level semantics) of the documents. Top-ranked words in a topic- word distribution indicate the meaning of the topic. Thus, topic models provide an effective framework for extracting the latent semantics from unstructured text collections. However, even though the topic word distributions are usually meaningful, it is very challenging for the users to accurately interpret the meaning of the topics based only on the word distributions extracted from the corpus, particularly when they are not familiar with the domain of the corpus. For example, Table I shows the top words of a topic learned from a collection of computer science abstracts; the topic has been labeled by a human “relational databases”. Topic labeling means finding one or a few phrases that TABLE I. EXAMPLE OF A TOPIC WITH ITS LABEL. Human Label: relational databases query database databases queries processing efficient relational sufficiently explain the meaning of the topic. This task, which can be labor intensive particularly when dealing with hundreds of topics, has recently attracted considerable attention. Within the Semantic Web, numerous data sources have been published as ontologies. Many of them are inter- connected as Linked Open Data (LOD) 1 . For example, DB- pedia [4] (as part of LOD) is a publicly available knowledge base extracted from Wikipedia in the form of an ontology of concepts and relationships, making this vast amount of information programmatically accessible on the Web. Recently, automatic topic labeling has been an area of active research. [5] represented topics as multinomial dis- tribution over n-grams, so top n-grams of a topic can be used to label the topic. Mei et al. [6] proposed an approach to automatically label the topics by converting the labeling problem to an optimization problem. Thus, for each topic a candidate label is chosen that has the minimum Kullback- Leibler (KL) divergence and the maximum mutual information with the topic. In [7], the authors proposed a method for topic labeling based on: (1) generating the label candidate set from topic’s top-terms and titles of Wikipedia pages containing the topic’s top-terms; (2) scoring and ranking the candidate labels and selecting the top-ranked label as the label of the topic. Mao et al. [8] proposed a topic labeling approach which enhances the label selection by using the sibling and parent-child relations between topics. In a more recent work, Hulpus et al. [9] addressed the topic labeling by relying on the structured data from DBpedia. The main idea is to construct a topic graph of concepts corresponding to topic’s top-k words from the DBpedia, apply graph-based centrality algorithms to rank the concepts, and then select the most prominent concepts as labels of the topic. Our principal objective is to incorporate the semantic graph of concepts in an ontology, DBpedia here, and their various properties within unsupervised topic models, such as LDA. Our work is different from all previous works in that they basically focus on the topics learned via LDA topic model (i.e. topics are multinomial distribution over words). In our model, we introduce another latent variable called, concept, between 1 http://linkeddata.org/