NETHIC: A System for Automatic Text Classification using Neural Networks and Hierarchical Taxonomies Andrea Ciapetti 1 , Rosario Di Florio 1 , Luigi Lomasto 1 , Giuseppe Miscione 1 , Giulia Ruggiero 1 and Daniele Toti 1,2 a 1 Innovation Engineering S.r.l., Rome, Italy 2 Department of Sciences, Roma Tre University, Rome, Italy Keywords: Machine Learning, Neural Networks, Taxonomies, Text Classification. Abstract: This paper presents NETHIC, a software system for the automatic classification of textual documents based on hierarchical taxonomies and artificial neural networks. This approach combines the advantages of highly- structured hierarchies of textual labels with the versatility and scalability of neural networks, thus bringing about a textual classifier that displays high levels of performance in terms of both effectiveness and efficiency. The system has first been tested as a general-purpose classifier on a generic document corpus, and then applied to the specific domain tackled by DANTE, a European project that is meant to address criminal and terrorist- related online contents, showing consistent results across both application domains. 1 INTRODUCTION With the increasing use of social networks and with the digitalization of governmental structures, com- puter scientists are facing new challenges and needs. Usually, official communications and documentations are stored in the form of electronic textual docu- ments. The rest of the personal communications ex- changed by single individuals is represented by chat messages, tweets, e-mails, blog entries, etc. This leads to an increased volume of textual information that may consequently bring about increasing confu- sion and hinder the effectiveness of the communica- tion itself. Understanding the subject category each data item falls into and the topics discussed has be- come paramount for an effective management and analysis of this deluge of information. Indeed, dur- ing the latest years, significant effort and consider- able resources have been spent to satisfy this need within the context of governmental and commercial projects (Dalal and Zaveri, 2011). One of the poten- tial techniques to be used in this regard is the auto- matic text classification, which falls into the category of supervised machine learning tasks. Such a process is meant to automatically assign a set of pre-defined classes by using a machine learning technique (Sebas- tiani, 2002). This paper describes NETHIC, an auto- a https://orcid.org/0000-0002-9668-6961 matic text classification system based on a hierarchi- cal taxonomy and artificial neural networks (ANNs). Taxonomies represent knowledge in a structured and human-readable manner. Their hierarchical structure enables an efficient and automated content classifi- cation. (Wetzker and et al., 2008). Artificial neural networks, on the other hand, have some interesting properties that make this family of machine learn- ing algorithms very appealing when facing difficult pattern-discovery tasks. This combined approach is especially useful when a large amount of data is used during the training phase, and can be easily imple- mented in parallel architectures (i.e., with multi-core processors or systems with dedicated GPUs). This may drastically reduce the processing time compared to other kinds of algorithms, while achieving similar results in terms of effectiveness (Hermundstad et al., 2011). In this work, the NETHIC system is detailed, showing how it displays a significant level of perfor- mance by using different taxonomies. First, a generic taxonomy is used in order to obtain a general-purpose text classifier. Then, a specific taxonomy to tackle domain-specific texts and concepts from the DANTE Horizon 2020 project is introduced, so that it could be used to classify documents dealing with terrorist and criminal activities as well, the latter being the very objects of the DANTE project itself. This paper is structured as follows. In Section 2, 296 Ciapetti, A., Di Florio, R., Lomasto, L., Miscione, G., Ruggiero, G. and Toti, D. NETHIC: A System for Automatic Text Classification using Neural Networks and Hierarchical Taxonomies. DOI: 10.5220/0007709702960306 In Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019), pages 296-306 ISBN: 978-989-758-372-8 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved