Article Emerging Research Topic Detection Using Filtered-LDA Fuad Alattar * and Khaled Shaalan   Citation: Alattar, F.; Shaalan, K. Emerging Research Topic Detection Using Filtered-LDA. AI 2021, 2, 578–599. https://doi.org/10.3390/ ai2040035 Academic Editor: Amir Mosavi Received: 12 June 2021 Accepted: 21 October 2021 Published: 31 October 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). Faculty of Engineering and IT, The British University in Dubai, Dubai 345015, United Arab Emirates; Khaled.shaalan@buid.ac.ae * Correspondence: fuad.alattar@hotmail.com Abstract: Comparing two sets of documents to identify new topics is useful in many applications, like discovering trending topics from sets of scientiﬁc papers, emerging topic detection in microblogs, and interpreting sentiment variations in Twitter. In this paper, the main topic-modeling-based approaches to address this task are examined to identify limitations and necessary enhancements. To overcome these limitations, we introduce two separate frameworks to discover emerging topics through a ﬁltered latent Dirichlet allocation (ﬁltered-LDA) model. The model acts as a ﬁlter that identiﬁes old topics from a timestamped set of documents, removes all documents that focus on old topics, and keeps documents that discuss new topics. Filtered-LDA also genuinely reduces the chance of using keywords from old topics to represent emerging topics. The ﬁnal stage of the ﬁlter uses multiple topic visualization formats to improve human interpretability of the ﬁltered topics, and it presents the most-representative document for each topic. Keywords: emerging topic detection; research trend detection; topic discovery; topic modeling; hot topics; trending topics; FB-LDA; Filtered-LDA 1. Introduction Finding the right hot scientiﬁc topic is a common challenging task for many students and researchers. To illustrate, a PhD student should read many scientiﬁc papers on a speciﬁc ﬁeld to identify candidate emerging topics before proposing a dissertation. This time-consuming exercise usually covers multiple years of publications to spot evolution of new topics. During the last two decades, multiple techniques were introduced to handle similar tasks, wherein large sets of timestamped documents are processed by a text mining program to automatically detect emerging topics. Some of these techniques shall be brieﬂy described here in both Sections 2 and 3. However, we focus in this paper only on those techniques that employ topic models, which use statistical algorithms to detect topics that appear in texts. Topic models treat each part of data as a word document. A collection of word documents forms a corpus. Topics can usually be predicted based on some similar words that appear inside each document. Therefore, each document may consist of multiple topics, whereas its dominant topic is the one which is discussed more inside that document. Some topic models are nonprobabilistic, like the latent semantic analysis (LSA) [1] and the non-negative matrix factorization (NMF) [2], whereas other topic models are probabilistic, like probabilistic latent semantic analysis (PLSA) [3] and latent Dirichlet Allocation (LDA) [4]. LDA is one of the most used topic models because of its good performance and ability to produce coherent outputs for many applications [5]. LDA represents each document by a distribution of ﬁxed number of topics. Each one of these topics is represented by a distribution of words. Figure 1 shows a graphical LDA model, which is based on [6]. It includes three levels of representations. Corpus-level representation uses hyperparameters α and β, which are sampled once when a corpus is generated. The document-level representation’s variables AI 2021, 2, 578–599. https://doi.org/10.3390/ai2040035 https://www.mdpi.com/journal/ai