Journal of Intelligent & Fuzzy Systems 38 (2020) 5579–5588 DOI:10.3233/JIFS-179648 IOS Press 5579 Document summarization using a structural metrics based representation Augusto Villa-Monte a,1 , Laura Lanzarini a , Julieta Corvi a and Aurelio F. Bariviera b, a Instituto de Investigaci´ on en Inform´ atica LIDI (Centro CICPBA), Facultad de Inform´ atica, Universidad Nacional de La Plata, 50 y 120 S/N La Plata, Buenos Aires, Argentina b Universitat Rovira i Virgili, Department of Business, Av. Universitat 1 Reus, Spain Abstract. Currently, each person produces 1.7MB of information every second in different formats. However, the vast majority of information is text. This has increased the interest to study techniques to automate the identification of the relevant portions of text documents in order to offer as a result an automatic summary. This article presents a technique to extract the most representative sentences of a document taking into account by the user’s criteria. These criteria are learned using a neural network, from a minimum set of documents whose sentences have been rated by the user in terms of importance. To verify the performance of the proposed methodology, we used 220 scientific articles from the PLOS Medicine journal published between 2004 and 2016. The results obtained have been very satisfactory. Keywords: Text summarization, extractive summaries, sentence scoring, feature selection, neural networks 1. Introduction At present, data generation has become a natu- ral output of human activity. Due to technological progress many usual actions are recorded. For this reason, the efficient access and use of such data have become fundamentally necessary in all areas, in order to construct pieces of information. Today, there are many areas interested in extracting knowledge from stored information, and even more so in the case of unstructured information. Although data is continuously generated in many formats, most of the digital information is stored in text format. For example, each email sent, each search made on the Internet or each publication generates, to a greater or lesser extent, textual data. In general, text is stored in the form of unstructured digital documents, using a very different organization 1 Post-Doctoral Fellow at National University of La Plata. Corresponding author. Aurelio F. Bariviera, Universitat Rovira i Virgili, Department of Business, Av. Universitat 1 43204 Reus, Spain. E-mail: aurelio.fernandez@urv.cat. from that of traditional databases. Scientific literature is characterized by producing an enormous amount of text documents that cover all the thematic areas of human knowledge. The use of automatic tools to summarize text is essential, since it could facilitate the access to pertinent and available information. When the vol- ume of information is immense, separating manually what is essential is a difficult and time consuming task. The development of computational solutions to summarize text constitutes one of the current lines of research aiming at reducing the prob- lems generated by the excessive growth of textual information. Obtaining computer summaries reduces the enor- mous volume of unstructured information to its core content, in order to facilitate its manipulation. By reading an automatic summary, the user could get access to the main content of a document in less time than it would had taken after reading the original document. Although the definition of summary may vary from one author to another, we could fairly agree to define ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved