Vol.:(0123456789) 1 3
Int. j. inf. tecnol. (April 2023) 15(4):2187–2195
https://doi.org/10.1007/s41870-023-01268-w
ORIGINAL RESEARCH
An integrated clustering and BERT framework for improved
topic modeling
Lijimol George
1
· P. Sumathy
1
Received: 20 September 2022 / Accepted: 11 April 2023 / Published online: 6 May 2023
© The Author(s), under exclusive licence to Bharati Vidyapeeth’s Institute of Computer Applications and Management 2023
Abstract Topic modelling is a machine learning tech-
nique that is extensively used in Natural Language Pro-
cessing (NLP) applications to infer topics within unstruc-
tured textual data. Latent Dirichlet Allocation (LDA) is
one of the most used topic modeling techniques that can
automatically detect topics from a huge collection of text
documents. However, the LDA-based topic models alone
do not always provide promising results. Clustering is one
of the effective unsupervised machine learning algorithms
that are extensively used in applications including extract-
ing information from unstructured textual data and topic
modeling. A hybrid model of Bidirectional Encoder Repre-
sentations from Transformers (BERT) and Latent Dirichlet
Allocation (LDA) in topic modeling with clustering based
on dimensionality reduction have been studied in detail. As
the clustering algorithms are computationally complex, the
complexity increases with the higher number of features,
the PCA, t-SNE and UMAP based dimensionality reduction
methods are also performed. Finally, a unified clustering-
based framework using BERT and LDA is proposed as part
of this study for mining a set of meaningful topics from
the massive text corpora. The experiments are conducted
to demonstrate the effectiveness of the cluster-informed
topic modeling framework using BERT and LDA by simu-
lating user input on benchmark datasets. The experimental
results show that clustering with dimensionality reduction
would help infer more coherent topics and hence this unified
clustering and BERT-LDA based approach can be effectively
utilized for building topic modeling applications.
Keywords Latent Dirichlet allocation (LDA) · Topic
modeling · k-means clustering · Dimensionality reduction ·
Bidirectional encoder representations from transformers
(BERT)
1 Introduction
As the online data volume is increasing exponentially, it is
essential to understand the topics from the available data
promptly and correctly. There is an increased demand for
automatic topic modeling system due to the enormous vol-
ume of text data available in electronic form today and the
limitations of human reading abilities. There are multiple
techniques used to discover the topics from text, images,
and videos [1]. Topic models have been extensively utilized
in Topic Recognition and Tracking tasks, which helps to
track, detect, and designate topics from a stream of docu-
ments or texts. This machine learning technique is widely
used in NLP applications to analyze the unstructured textual
data and automatically discover the abstract topics within it.
There are various methods proposed to accurately identify
topics [2–4] from huge text corpora.
This study is intended to model various text mining tech-
niques to infer hidden semantic structures and hence the
topics in a text corpus. Topic modeling experiments are
conducted using the CORD-19 dataset. The topic modeling
investigations are performed based on sentence embed-
ding after pre-processing. A hybrid model of Bidirectional
Encoder Representations from Transformers (BERT) and
Latent Dirichlet Allocation (LDA) in topic modeling have
been studied in detail.
* Lijimol George
lijimol.george@tcs.com
P. Sumathy
sumathy.p@bdu.ac.in
1
Department of Computer Science, Bharathidasan University,
Tiruchirappalli 620 023, Tamil Nadu, India