Exploring Swedish & English fastText Embeddings with the Transformer Tosin P. Adewumi, Foteini Liwicki & Marcus Liwicki firstname.lastname@ltu.se EISLAB SRT Department Lule University of Technology Sweden Abstract In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from far larger corpora and we present the new Swedish analogy test set. To achieve a good network performance in natural language processing (NLP) down- stream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embed- dings at the intrinsic level and extrinsic level, by deploying them on the Transformer in named entity recognition (NER) task and conduct significance tests. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with far smaller training data, compared to recently released, common crawl versions and character n-grams appear useful for Swedish, a morphologically rich language. Keywords: Embeddings, Transformer, Analogy, Dataset, NER, Swedish 1. Introduction The embedding layer of neural networks may be initialized randomly or replaced with pre- trained vectors, which act as lookup tables. One of such pre-trained vector tools include fastText, introduced by Joulin et al. (2016). The main advantages of fastText are speed and competitive performance to state-of-the-art (SotA). Using pre-trained embeddings in deep networks like the Transformer can improve performance. Vaswani et al. (2017) intro- duced the Transformer, a SotA architecture based on self-attention mechanisms only, and it demonstrated better performance while requiring less time to train. Usually, downstream tasks are applied after pre-training language models on such deep networks (Brown et al., 2020; Devlin et al., 2018). Despite the plethora of embeddings in many languages, there’s a dearth of analogy test set to evaluate many of them, including for Swedish (Al-Rfou et al., 2013; Fallgren et al., 2016; Pr´ecenth, 2019; Venekoski and Vankka, 2017). This is because creating labelled or structured datasets can be expensive in terms of time and attention required. Grave et al. (2018) created 157 different language embeddings but provided analogy test set for only 3 languages: French, Hindi and Polish. An analogy test set, introduced by Mikolov et al. (2013), provides some inclination as to the quality and likely performance of word 1 arXiv:2007.16007v1 [cs.CL] 23 Jul 2020