Atalaya at TASS 2018: Sentiment Analysis with
Tweet Embeddings and Data Augmentation
Atalaya en TASS 2018: An´alisis de Sentimiento con
Embeddings de Tweets y Aumentaci´on de Datos
Franco M. Luque
1
, Juan Manuel P´ erez
2
1
Universidad Nacional de C´ ordoba & CONICET
2
Universidad de Buenos Aires & CONICET
francolq@famaf.unc.edu.ar, jmperez@dc.uba.ar
Resumen: El workshop TASS 2018 propone diferentes desaf´ ıos de an´alisis
sem´antico del Espa˜ nol. Este trabajo presenta nuestra participaci´ on con el equipo
Atalaya en la tarea de clasificaci´ on de polaridad de tweets. Seguimos t´ ecnicas
est´ andar de preprocesamiento, representaci´ on y clasificaci´ on, y tambi´ en exploramos
algunas ideas novedosas. En particular, para obtener embeddings de tweets entre-
namos word embeddings con informaci´ on de subpalabras, y usamos un esquema
de pesaje para promediarlos. Para lidiar con problemas de sobreajuste causados
por la escasez de datos de entrenamiento, probamos una estrategia de aumentaci´ on
de datos basada en traducci´ on autom´atica bidireccional. Experimentos con clasi-
ficadores lineales y modelos neuronales muestran resultados competitivos para las
diferentes subtareas propuestas en el desaf´ ıo.
Palabras clave: An´alisis de Sentimiento, Clasificaci´on de Polaridad, Embeddings,
Aumentaci´ on de Datos, Modelos Lineales, Redes Neuronales
Abstract: TASS 2018 workshop proposes different challenges on semantic analy-
sis in Spanish. This work presents our participation as team Atalaya in the task of
polarity classification of tweets. We followed standard techniques in preprocessing,
representation and classification, and also explored some novel ideas. In particu-
lar, to obtain tweet embeddings we trained subword-aware word embeddings and
use a weighted scheme to average them. To deal with overfitting problems caused
by training data scarcity, we tried a data augmentation strategy based on two-way
machine translation. Experiments with linear classifiers and neural models show
competitive results for the different subtasks proposed in the challenge.
Keywords: Sentiment Analysis, Polarity Classification, Embeddings, Data Aug-
mentation, Linear Models, Neural Networks
1 Introduction
The TASS workshop presents every year dif-
ferent challenges related to sentiment analy-
sis in Spanish. One of the main tasks is polar-
ity classification of tweets and tweet aspects.
In particular, task 1 of TASS 2018 (Mart´ ınez-
C´amara et al., 2018) proposes polarity clas-
sification on tweet datasets from three differ-
ent Spanish speaking countries: Spain (ES),
Costa Rica (CR) and Per´ u (PE). This arti-
cle describes our participation in TASS 2018
task 1 with team Atalaya. We present polar-
ity classification systems using standard tech-
niques and propose improvements based on
an iterative experimental development pro-
cess. We tried different approaches for tweet
preprocessing, vector representation and po-
larity classification models. Standard pre-
processing techniques, including text sim-
plification, stopword filtering, lemmatization
and negation handling were used. Tweets
were represented with bag-of-words, bag-of-
characters, tweet embeddings and combina-
tions of these. As classification models, we
considered linear classifiers and neural net-
works.
We used fastText subword-aware word
vectors using tweet datasets specifically pre-
TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 29-35
ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.