Appl. Sci. 2022, 12, 6584. https://doi.org/10.3390/app12136584 www.mdpi.com/journal/applsci
Article
Automatic Text Summarization for Hindi Using Real Coded
Genetic Algorithm
Arti Jain
1,
*, Anuja Arora
1
, Jorge Morato
2
, Divakar Yadav
3
and Kumar Vimal Kumar
1
1
Department of CSE, Jaypee Institute of Information Technology, Noida 201309, India;
anuja.arora29@gmail.com (A.A.); vimalkumar.k@gmail.com (K.V.K.)
2
Computer Science, Universidad Carlos III de Madrid, 28911 Leganes, Spain; jmorato@inf.uc3m.es
3
Department of CSE, NIT Hamirpur, Hamirpur 177005, India; divakar.yadav0@gmail.com
* Correspondence: ajain.jiit@gmail.com; Tel.: +91‐9313519476
Featured Application: This paper provides applicability of the Real Coded Genetic Algorithm to
the Natural Language Processing Task, i.e., Text Summarization. The purpose of text summari‐
zation is to reduce an extensive document into a concise format such that the essence of the con‐
tent is retained. By doing so, users can utilize the summarized document for vivid applications
such as Question Answering, Machine Translation, Fake News Detection, and Named Entity
Recognition to name a selected few.
Abstract: In the present scenario, Automatic Text Summarization (ATS) is in great demand to ad‐
dress the ever‐growing volume of text data available online to discover relevant information faster.
In this research, the ATS methodology is proposed for the Hindi language using Real Coded Genetic
Algorithm (RCGA) over the health corpus, available in the Kaggle dataset. The methodology com‐
prises five phases: preprocessing, feature extraction, processing, sentence ranking, and summary
generation. Rigorous experimentation on varied feature sets is performed where distinguishing fea‐
tures, namely‐ sentence similarity and named entity features are combined with others for compu‐
ting the evaluation metrics. The top 14 feature combinations are evaluated through Recall‐Oriented
Understudy for Gisting Evaluation (ROUGE) measure. RCGA computes appropriate feature
weights through strings of features, chromosomes selection, and reproduction operators: Simulat‐
ing Binary Crossover and Polynomial Mutation. To extract the highest scored sentences as the cor‐
pus summary, different compression rates are tested. In comparison with existing summarization
tools, the ATS extractive method gives a summary reduction of 65%.
Keywords: automatic text summarization; extractive summary; feature set; Hindi language; Hindi
health data; named entity; real coded genetic algorithm; ROUGE metric; summarization tool
1. Introduction
Automatic Text Summarization (ATS) [1,2] is a process to generate a summary while
preserving the essence, by eliminating irrelevant or redundant content from the text. ATS
provides vital information in a much shorter version, usually reduced to less than half of
the length of the input text. It remedies the challenge of information overload and helps
in information retrieval tasks. ATS provides concise information with reduced redun‐
dancy [3] in an effective manner related to news articles [4], emails, official government
documents, and many more. In generality, ATS utilizes either an extractive summary [5]
or an abstractive summary [6]. An extractive summary is generated while selecting essen‐
tial sentences from the given textual document. The sentence selection criteria are based
on the textʹs statistical parameters and linguistic features to combine those sentences into
the final summary. On the other hand, an abstractive summary is generated by consider‐
ing into the more profound understanding of semantics for the given textual document.
Citation: Jain, A.; Arora, A.;
Morato, J.; Yadav, D.; Kumar, K. V.
Automatic Text Summarization for
Hindi Using Real Coded Genetic
Algorithm. Appl. Sci. 2022, 12, 6584.
https://doi.org/10.3390/app12136584
Academic Editors: Julian Szymanski,
Higinio Mora, Doina Logofătu
and Andrzej Sobecki
Received: 20 April 2022
Accepted: 26 June 2022
Published: 29 June 2022
Publisher’s Note: MDPI stays neu‐
tral with regard to jurisdictional
claims in published maps and institu‐
tional affiliations.
Copyright: © 2022 by the authors. Li‐
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con‐
ditions of the Creative Commons At‐
tribution (CC BY) license (https://cre‐
ativecommons.org/licenses/by/4.0/).