Article
Emerging Research Topic Detection Using Filtered-LDA
Fuad Alattar * and Khaled Shaalan
Citation: Alattar, F.; Shaalan, K.
Emerging Research Topic Detection
Using Filtered-LDA. AI 2021, 2,
578–599. https://doi.org/10.3390/
ai2040035
Academic Editor: Amir Mosavi
Received: 12 June 2021
Accepted: 21 October 2021
Published: 31 October 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Faculty of Engineering and IT, The British University in Dubai, Dubai 345015, United Arab Emirates;
Khaled.shaalan@buid.ac.ae
* Correspondence: fuad.alattar@hotmail.com
Abstract: Comparing two sets of documents to identify new topics is useful in many applications, like
discovering trending topics from sets of scientific papers, emerging topic detection in microblogs, and
interpreting sentiment variations in Twitter. In this paper, the main topic-modeling-based approaches
to address this task are examined to identify limitations and necessary enhancements. To overcome
these limitations, we introduce two separate frameworks to discover emerging topics through a
filtered latent Dirichlet allocation (filtered-LDA) model. The model acts as a filter that identifies old
topics from a timestamped set of documents, removes all documents that focus on old topics, and
keeps documents that discuss new topics. Filtered-LDA also genuinely reduces the chance of using
keywords from old topics to represent emerging topics. The final stage of the filter uses multiple
topic visualization formats to improve human interpretability of the filtered topics, and it presents
the most-representative document for each topic.
Keywords: emerging topic detection; research trend detection; topic discovery; topic modeling; hot
topics; trending topics; FB-LDA; Filtered-LDA
1. Introduction
Finding the right hot scientific topic is a common challenging task for many students
and researchers. To illustrate, a PhD student should read many scientific papers on a
specific field to identify candidate emerging topics before proposing a dissertation. This
time-consuming exercise usually covers multiple years of publications to spot evolution of
new topics. During the last two decades, multiple techniques were introduced to handle
similar tasks, wherein large sets of timestamped documents are processed by a text mining
program to automatically detect emerging topics. Some of these techniques shall be briefly
described here in both Sections 2 and 3. However, we focus in this paper only on those
techniques that employ topic models, which use statistical algorithms to detect topics that
appear in texts. Topic models treat each part of data as a word document. A collection of
word documents forms a corpus. Topics can usually be predicted based on some similar
words that appear inside each document. Therefore, each document may consist of multiple
topics, whereas its dominant topic is the one which is discussed more inside that document.
Some topic models are nonprobabilistic, like the latent semantic analysis (LSA) [1]
and the non-negative matrix factorization (NMF) [2], whereas other topic models are
probabilistic, like probabilistic latent semantic analysis (PLSA) [3] and latent Dirichlet
Allocation (LDA) [4].
LDA is one of the most used topic models because of its good performance and ability
to produce coherent outputs for many applications [5]. LDA represents each document
by a distribution of fixed number of topics. Each one of these topics is represented by a
distribution of words.
Figure 1 shows a graphical LDA model, which is based on [6]. It includes three levels
of representations. Corpus-level representation uses hyperparameters α and β, which are
sampled once when a corpus is generated. The document-level representation’s variables
AI 2021, 2, 578–599. https://doi.org/10.3390/ai2040035 https://www.mdpi.com/journal/ai