  Citation: Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A Survey on Text Classiﬁcation Algorithms: From Text to Predictions. Information 2022, 13, 83. https://doi.org/ 10.3390/info13020083 Academic Editor: Gennady Agre Received: 10 January 2022 Accepted: 9 February 2022 Published: 11 February 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). information Review A Survey on Text Classiﬁcation Algorithms: From Text to Predictions Andrea Gasparetto 1, * , Matteo Marcuzzo 1 , Alessandro Zangari 1 and Andrea Albarelli 2 1 Department of Management, Ca’ Foscari University, 30123 Venice, Italy; matteo.marcuzzo@unive.it (M.M.); alessandro.zangari@unive.it (A.Z.) 2 Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University, 30123 Venice, Italy; albarelli@unive.it * Correspondence: andrea.gasparetto@unive.it Abstract: In recent years, the exponential growth of digital documents has been met by rapid progress in text classiﬁcation techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classiﬁcation step. This paper offers a concise review of recent text classiﬁcation models, with emphasis on the ﬂow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classiﬁcation landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models. Keywords: text classiﬁcation; tokenisation; topic labelling; news classiﬁcation; transformer; shallow learning; deep learning; multilabel corpora 1. Introduction Text classiﬁcation (TC) is a task of fundamental importance, and it has been gaining traction thanks to recent developments in the ﬁelds of text mining and natural language processing (NLP). Text classiﬁcation methods share the common goal of designating a predeﬁned label for a given input text, though this denomination can refer to a variety of specialised methods applied to different domains. Classic examples of TC include information retrieval, topic labelling, sentiment analy- sis, and news classiﬁcation. However, TC has practical applications that extend beyond simple categorisation, such as extractive question answering and summarisation systems. In this case, the intuitive notion of “label” is substituted with a choice between candidates (e.g., an answer or a sentence to include in a summary). The speed at which textual information is currently being created has long out- classed manual solutions to these tasks, meaning that TC methods are not only useful, but also strictly necessary. Accordingly, developing accurate and unbiased TC systems is of paramount importance. 1.1. Text Classiﬁcation Tasks A variety of standard deﬁnitions for TC tasks exist in the NLP research area, often used as benchmarks to evaluate new methods. We outline the main representatives, approximately following the taxonomy proposed by Li et al. [1]: Information 2022, 13, 83. https://doi.org/10.3390/info13020083 https://www.mdpi.com/journal/information