Atalaya at TASS 2018: Sentiment Analysis with Tweet Embeddings and Data Augmentation Atalaya en TASS 2018: An´alisis de Sentimiento con Embeddings de Tweets y Aumentaci´on de Datos Franco M. Luque 1 , Juan Manuel P´ erez 2 1 Universidad Nacional de C´ ordoba & CONICET 2 Universidad de Buenos Aires & CONICET francolq@famaf.unc.edu.ar, jmperez@dc.uba.ar Resumen: El workshop TASS 2018 propone diferentes desaf´ ıos de an´alisis sem´antico del Espa˜ nol. Este trabajo presenta nuestra participaci´ on con el equipo Atalaya en la tarea de clasificaci´ on de polaridad de tweets. Seguimos t´ ecnicas est´ andar de preprocesamiento, representaci´ on y clasificaci´ on, y tambi´ en exploramos algunas ideas novedosas. En particular, para obtener embeddings de tweets entre- namos word embeddings con informaci´ on de subpalabras, y usamos un esquema de pesaje para promediarlos. Para lidiar con problemas de sobreajuste causados por la escasez de datos de entrenamiento, probamos una estrategia de aumentaci´ on de datos basada en traducci´ on autom´atica bidireccional. Experimentos con clasi- ficadores lineales y modelos neuronales muestran resultados competitivos para las diferentes subtareas propuestas en el desaf´ ıo. Palabras clave: An´alisis de Sentimiento, Clasificaci´on de Polaridad, Embeddings, Aumentaci´ on de Datos, Modelos Lineales, Redes Neuronales Abstract: TASS 2018 workshop proposes different challenges on semantic analy- sis in Spanish. This work presents our participation as team Atalaya in the task of polarity classification of tweets. We followed standard techniques in preprocessing, representation and classification, and also explored some novel ideas. In particu- lar, to obtain tweet embeddings we trained subword-aware word embeddings and use a weighted scheme to average them. To deal with overfitting problems caused by training data scarcity, we tried a data augmentation strategy based on two-way machine translation. Experiments with linear classifiers and neural models show competitive results for the different subtasks proposed in the challenge. Keywords: Sentiment Analysis, Polarity Classification, Embeddings, Data Aug- mentation, Linear Models, Neural Networks 1 Introduction The TASS workshop presents every year dif- ferent challenges related to sentiment analy- sis in Spanish. One of the main tasks is polar- ity classification of tweets and tweet aspects. In particular, task 1 of TASS 2018 (Mart´ ınez- C´amara et al., 2018) proposes polarity clas- sification on tweet datasets from three differ- ent Spanish speaking countries: Spain (ES), Costa Rica (CR) and Per´ u (PE). This arti- cle describes our participation in TASS 2018 task 1 with team Atalaya. We present polar- ity classification systems using standard tech- niques and propose improvements based on an iterative experimental development pro- cess. We tried different approaches for tweet preprocessing, vector representation and po- larity classification models. Standard pre- processing techniques, including text sim- plification, stopword filtering, lemmatization and negation handling were used. Tweets were represented with bag-of-words, bag-of- characters, tweet embeddings and combina- tions of these. As classification models, we considered linear classifiers and neural net- works. We used fastText subword-aware word vectors using tweet datasets specifically pre- TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 29-35 ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.