Atalaya at TASS 2018: Sentiment Analysis with Tweet Embeddings and Data Augmentation Atalaya en TASS 2018: An´alisis de Sentimiento con Embeddings de Tweets y Aumentaci´on de Datos Franco M. Luque 1 , Juan Manuel P´ erez 2 1 Universidad Nacional de C´ ordoba & CONICET 2 Universidad de Buenos Aires & CONICET francolq@famaf.unc.edu.ar, jmperez@dc.uba.ar Resumen: El workshop TASS 2018 propone diferentes desaf´ ıos de an´alisis sem´antico del Espa˜ nol. Este trabajo presenta nuestra participaci´ on con el equipo Atalaya en la tarea de clasiﬁcaci´ on de polaridad de tweets. Seguimos t´ ecnicas est´ andar de preprocesamiento, representaci´ on y clasiﬁcaci´ on, y tambi´ en exploramos algunas ideas novedosas. En particular, para obtener embeddings de tweets entre- namos word embeddings con informaci´ on de subpalabras, y usamos un esquema de pesaje para promediarlos. Para lidiar con problemas de sobreajuste causados por la escasez de datos de entrenamiento, probamos una estrategia de aumentaci´ on de datos basada en traducci´ on autom´atica bidireccional. Experimentos con clasi- ﬁcadores lineales y modelos neuronales muestran resultados competitivos para las diferentes subtareas propuestas en el desaf´ ıo. Palabras clave: An´alisis de Sentimiento, Clasiﬁcaci´on de Polaridad, Embeddings, Aumentaci´ on de Datos, Modelos Lineales, Redes Neuronales Abstract: TASS 2018 workshop proposes diﬀerent challenges on semantic analy- sis in Spanish. This work presents our participation as team Atalaya in the task of polarity classiﬁcation of tweets. We followed standard techniques in preprocessing, representation and classiﬁcation, and also explored some novel ideas. In particu- lar, to obtain tweet embeddings we trained subword-aware word embeddings and use a weighted scheme to average them. To deal with overﬁtting problems caused by training data scarcity, we tried a data augmentation strategy based on two-way machine translation. Experiments with linear classiﬁers and neural models show competitive results for the diﬀerent subtasks proposed in the challenge. Keywords: Sentiment Analysis, Polarity Classiﬁcation, Embeddings, Data Aug- mentation, Linear Models, Neural Networks 1 Introduction The TASS workshop presents every year dif- ferent challenges related to sentiment analy- sis in Spanish. One of the main tasks is polar- ity classiﬁcation of tweets and tweet aspects. In particular, task 1 of TASS 2018 (Mart´ ınez- C´amara et al., 2018) proposes polarity clas- siﬁcation on tweet datasets from three diﬀer- ent Spanish speaking countries: Spain (ES), Costa Rica (CR) and Per´ u (PE). This arti- cle describes our participation in TASS 2018 task 1 with team Atalaya. We present polar- ity classiﬁcation systems using standard tech- niques and propose improvements based on an iterative experimental development pro- cess. We tried diﬀerent approaches for tweet preprocessing, vector representation and po- larity classiﬁcation models. Standard pre- processing techniques, including text sim- pliﬁcation, stopword ﬁltering, lemmatization and negation handling were used. Tweets were represented with bag-of-words, bag-of- characters, tweet embeddings and combina- tions of these. As classiﬁcation models, we considered linear classiﬁers and neural net- works. We used fastText subword-aware word vectors using tweet datasets speciﬁcally pre- TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 29-35 ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.