Sentiment Analysis on Brazilian Portuguese User Reviews Frederico Dias Souza Eletrical Engineering Department Federal University of Rio de Janeiro Rio de Janeiro, Brazil fredericods@poli.ufrj.br Jo˜ ao Baptista de Oliveira e Souza Filho Eletrical Engineering Department Federal University of Rio de Janeiro Rio de Janeiro, Brazil jbfilho@poli.ufrj.br Abstract—Sentiment Analysis is one of the most classical and primarily studied natural language processing tasks. This problem had a notable advance with the proposition of more complex and scalable machine learning models. Despite this progress, the Brazilian Portuguese language still disposes only of limited linguistic resources, such as datasets dedicated to sentiment classification, especially when considering the existence of predefined partitions in training, testing, and validation sets that would allow a more fair comparison of different algorithm alternatives. Motivated by these issues, this work analyzes the predictive performance of a range of document embedding strate- gies, assuming the polarity as the system outcome. This analysis includes five sentiment analysis datasets in Brazilian Portuguese, unified in a single dataset, and a reference partitioning in training, testing, and validation sets, both made publicly available through a digital repository. A cross-evaluation of dataset-specific models over different contexts is conducted to evaluate their generalization capabilities and the feasibility of adopting a unique model for addressing all scenarios. Index Terms—Text Classification, Sentiment Analysis, Natural Language Processing, Machine Learning, Benchmarks I. I NTRODUCTION Text classification (TC) is a classical natural language processing (NLP) application. The most basic approach for TC consists of extracting specific features from the documents, subsequently feeding them to some classifier responsable for predicting document labels. One of the most popular methods for addressing this feature extraction task is the bag-of- words (BoW). The BoW produces a reduced and simplified representation of an entire document, ignoring aspects like grammar, word appearance order, and semantic relations be- tween words and phrases. A common approach is the word frequency, or the term frequency-inverse document frequency (TF-IDF) [1]. Commonly, the BoW is followed by a classical machine learning (ML) classifier, such as Logistic Regression, Support Vector Machines, Gradient Boosting Decision trees, or Random Forests, in sentiment classification tasks. Since most of these models are fast and straightforward to implement and train, they represent a handy baseline. Regardless of their simplicity, such methods can achieve high performance for simple texts, comparable or even better than more complex alternatives. A drawback faced by BoW models is not easily This study was financed in part by the Coordenac ¸˜ ao de Aperfeic ¸oamento de Pessoal de N´ ıvel Superior - Brasil (CAPES) Finance Code 001. generalize to new tasks and properly deal with the large amounts of training data available nowadays [2]. Since the shift in the ML paradigm motivated by the AlexNet [3], the state-of-the-art models in NLP and Computer Vision mostly include deep learning (DL) architectures [2]. Despite often being more challenging and slower to train, such architectures can easily learn complex patterns and scale to larger datasets. Compared to the classical models, the DL counterparts do not require hand-crafted feature extraction, leading to automatically learned features during model train- ing [4]. Roughly, a neural-based text classification model can be as simple as a feedforward neural network inputted with a high dimensional vector, defined by some aggregation process of multiple vectors (embeddings) representing the words integrat- ing a document. Popular word embeddings include Word2Vec [5], GloVe [6], and FastText [7]. A remarkable example is due to Iyyer et al. that proposed the Deep Average Network (DAN) [8], according to which the embeddings associated with an input sequence of tokens are averaged and the resulting vector is fed through several feedforward layers to produce a vector representing the whole sentence, finally submitted to a simple linear classifier. Typically, Recurrent Neural Networks (RNNs) can better exploit more complex data patterns than bag-of-words ap- proaches, accessing more effectively their mutual dependen- cies, thus better capturing the sentence context [2]. The most popular variant of the RNNs is the Long Short-Term Memory (LSTM), firstly proposed by Hochreiter and Schmidhuber [9], aiming to mitigate the gradient vanishing and exploding problems related to the RNNs [2]. Later, many variants of this model were proposed in the context of text classification, for instance, the remarkable work of Zhou et al. [10], which proposed a bidirectional LSTM followed by a 2D max-pooling operation. More recently, the Transformers, proposed by Vaswani et al. [11], has revolutionized the NLP area. This work has brought a smart alternative to the sequential and slow training faced with the RNNs by solely replacing the recursion mechanism by multiple attention layers [2], whose major distinguishing characteristic refers to the fact that now the training can be performed in parallel. As a result, the complexity of the NLP arXiv:2112.05459v1 [cs.CL] 10 Dec 2021