Citation: Gasparetto, A.; Marcuzzo,
M.; Zangari, A.; Albarelli, A. A
Survey on Text Classification
Algorithms: From Text to Predictions.
Information 2022, 13, 83.
https://doi.org/
10.3390/info13020083
Academic Editor: Gennady Agre
Received: 10 January 2022
Accepted: 9 February 2022
Published: 11 February 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
information
Review
A Survey on Text Classification Algorithms: From Text
to Predictions
Andrea Gasparetto
1,
* , Matteo Marcuzzo
1
, Alessandro Zangari
1
and Andrea Albarelli
2
1
Department of Management, Ca’ Foscari University, 30123 Venice, Italy; matteo.marcuzzo@unive.it (M.M.);
alessandro.zangari@unive.it (A.Z.)
2
Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University, 30123 Venice, Italy;
albarelli@unive.it
* Correspondence: andrea.gasparetto@unive.it
Abstract: In recent years, the exponential growth of digital documents has been met by rapid progress
in text classification techniques. Newly proposed machine learning algorithms leverage the latest
advancements in deep learning methods, allowing for the automatic extraction of expressive features.
The swift development of these methods has led to a plethora of strategies to encode natural language
into machine-interpretable data. The latest language modelling algorithms are used in conjunction
with ad hoc preprocessing procedures, of which the description is often omitted in favour of a
more detailed explanation of the classification step. This paper offers a concise review of recent
text classification models, with emphasis on the flow of data, from raw text to output labels. We
highlight the differences between earlier methods and more recent, deep learning-based methods
in both their functioning and in how they transform input data. To give a better perspective on the
text classification landscape, we provide an overview of datasets for the English language, as well
as supplying instructions for the synthesis of two new multilabel datasets, which we found to be
particularly scarce in this setting. Finally, we provide an outline of new experimental results and
discuss the open research challenges posed by deep learning-based language models.
Keywords: text classification; tokenisation; topic labelling; news classification; transformer; shallow
learning; deep learning; multilabel corpora
1. Introduction
Text classification (TC) is a task of fundamental importance, and it has been gaining
traction thanks to recent developments in the fields of text mining and natural language
processing (NLP). Text classification methods share the common goal of designating a
predefined label for a given input text, though this denomination can refer to a variety of
specialised methods applied to different domains.
Classic examples of TC include information retrieval, topic labelling, sentiment analy-
sis, and news classification. However, TC has practical applications that extend beyond
simple categorisation, such as extractive question answering and summarisation systems.
In this case, the intuitive notion of “label” is substituted with a choice between candidates
(e.g., an answer or a sentence to include in a summary).
The speed at which textual information is currently being created has long out-
classed manual solutions to these tasks, meaning that TC methods are not only useful, but
also strictly necessary. Accordingly, developing accurate and unbiased TC systems is of
paramount importance.
1.1. Text Classification Tasks
A variety of standard definitions for TC tasks exist in the NLP research area, often
used as benchmarks to evaluate new methods. We outline the main representatives,
approximately following the taxonomy proposed by Li et al. [1]:
Information 2022, 13, 83. https://doi.org/10.3390/info13020083 https://www.mdpi.com/journal/information