Vol.:(0123456789) 1 3 Int. j. inf. tecnol. (April 2023) 15(4):2187–2195 https://doi.org/10.1007/s41870-023-01268-w ORIGINAL RESEARCH An integrated clustering and BERT framework for improved topic modeling Lijimol George 1  · P. Sumathy 1   Received: 20 September 2022 / Accepted: 11 April 2023 / Published online: 6 May 2023 © The Author(s), under exclusive licence to Bharati Vidyapeeth’s Institute of Computer Applications and Management 2023 Abstract Topic modelling is a machine learning tech- nique that is extensively used in Natural Language Pro- cessing (NLP) applications to infer topics within unstruc- tured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extract- ing information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Repre- sentations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering- based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simu- lating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications. Keywords Latent Dirichlet allocation (LDA) · Topic modeling · k-means clustering · Dimensionality reduction · Bidirectional encoder representations from transformers (BERT) 1 Introduction As the online data volume is increasing exponentially, it is essential to understand the topics from the available data promptly and correctly. There is an increased demand for automatic topic modeling system due to the enormous vol- ume of text data available in electronic form today and the limitations of human reading abilities. There are multiple techniques used to discover the topics from text, images, and videos [1]. Topic models have been extensively utilized in Topic Recognition and Tracking tasks, which helps to track, detect, and designate topics from a stream of docu- ments or texts. This machine learning technique is widely used in NLP applications to analyze the unstructured textual data and automatically discover the abstract topics within it. There are various methods proposed to accurately identify topics [24] from huge text corpora. This study is intended to model various text mining tech- niques to infer hidden semantic structures and hence the topics in a text corpus. Topic modeling experiments are conducted using the CORD-19 dataset. The topic modeling investigations are performed based on sentence embed- ding after pre-processing. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling have been studied in detail. * Lijimol George lijimol.george@tcs.com P. Sumathy sumathy.p@bdu.ac.in 1 Department of Computer Science, Bharathidasan University, Tiruchirappalli 620 023, Tamil Nadu, India