Journal of Intelligent & Fuzzy Systems 38 (2020) 5579–5588
DOI:10.3233/JIFS-179648
IOS Press
5579
Document summarization using a structural
metrics based representation
Augusto Villa-Monte
a,1
, Laura Lanzarini
a
, Julieta Corvi
a
and Aurelio F. Bariviera
b,∗
a
Instituto de Investigaci´ on en Inform´ atica LIDI (Centro CICPBA), Facultad de Inform´ atica, Universidad
Nacional de La Plata, 50 y 120 S/N La Plata, Buenos Aires, Argentina
b
Universitat Rovira i Virgili, Department of Business, Av. Universitat 1 Reus, Spain
Abstract. Currently, each person produces 1.7MB of information every second in different formats. However, the vast majority
of information is text. This has increased the interest to study techniques to automate the identification of the relevant portions
of text documents in order to offer as a result an automatic summary. This article presents a technique to extract the most
representative sentences of a document taking into account by the user’s criteria. These criteria are learned using a neural
network, from a minimum set of documents whose sentences have been rated by the user in terms of importance. To verify
the performance of the proposed methodology, we used 220 scientific articles from the PLOS Medicine journal published
between 2004 and 2016. The results obtained have been very satisfactory.
Keywords: Text summarization, extractive summaries, sentence scoring, feature selection, neural networks
1. Introduction
At present, data generation has become a natu-
ral output of human activity. Due to technological
progress many usual actions are recorded. For this
reason, the efficient access and use of such data have
become fundamentally necessary in all areas, in order
to construct pieces of information.
Today, there are many areas interested in extracting
knowledge from stored information, and even more
so in the case of unstructured information. Although
data is continuously generated in many formats, most
of the digital information is stored in text format. For
example, each email sent, each search made on the
Internet or each publication generates, to a greater or
lesser extent, textual data.
In general, text is stored in the form of unstructured
digital documents, using a very different organization
1
Post-Doctoral Fellow at National University of La Plata.
∗
Corresponding author. Aurelio F. Bariviera, Universitat
Rovira i Virgili, Department of Business, Av. Universitat 1 43204
Reus, Spain. E-mail: aurelio.fernandez@urv.cat.
from that of traditional databases. Scientific literature
is characterized by producing an enormous amount
of text documents that cover all the thematic areas of
human knowledge.
The use of automatic tools to summarize text
is essential, since it could facilitate the access to
pertinent and available information. When the vol-
ume of information is immense, separating manually
what is essential is a difficult and time consuming
task. The development of computational solutions
to summarize text constitutes one of the current
lines of research aiming at reducing the prob-
lems generated by the excessive growth of textual
information.
Obtaining computer summaries reduces the enor-
mous volume of unstructured information to its core
content, in order to facilitate its manipulation. By
reading an automatic summary, the user could get
access to the main content of a document in less
time than it would had taken after reading the original
document.
Although the definition of summary may vary from
one author to another, we could fairly agree to define
ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved