Vector representation of Internet Domain Names using a Word Embedding technique Waldemar L´ opez * , Jorge Merlino * and Pablo Rodr´ıguez-Bocca * * Instituto de Computaci´ on Facultad de Ingenier´ıa, Universidad de la Rep´ ublica. Julio Herrera y Reissig 565, 11300, Montevideo, Uruguay. Email: walopez,jmerlino,prbocca@fing.edu.uy Abstract—Word embeddings is a well known set of tech- niques widely used in natural language processing (NLP), and word2vec is a computationally-efficient predictive model to learn such embeddings. This paper explores the use of word embeddings in a new scenario. We create a vector representation of Internet Domain Names (DNS) by taking the core ideas from NLP techniques and applying them to real anonymized DNS log queries from a large Internet Service Provider (ISP). Our main objective is to find semantically similar domains only using information of DNS queries without any other previous knowledge about the content of those domains. We use the word2vec unsupervised learning algorithm with a Skip-Gram model to create the embeddings. And we validate the quality of our results by expert visual inspection of similarities, and by comparing them with a third party source, namely, similar sites service offered by Alexa Internet, Inc. Index Terms—DNS, Word embeddings, word2vec, Tensorflow, Semantic Similarity, Natural Language Processing. I. I NTRODUCTION The amount of time that people spend online has systemati- cally increased in recent years [1]. To understand the behavior of users in online content consumption is the focus of several research. It has large implications to network design, online business, and media industry [2]. Many studies apply machine learning to historical patterns of network resource consump- tion in order to extract knowledge about online customer behavior [3], [4]. Due to the inaccessibility of the information, few of these studies use the traces of DNS queries for this purpose. The few exceptions are [5], [6], [7], [8], [9], where none of them has as main objective to extract knowledge about the semantic nature of the queried domains. There are several Web tools that try to estimate the semantic similarity between sites 1 . For example to provide web site owners the possibility to find competitors for the same target audience, and to advice end-users on alternative providers for the same content. As a novel application, in this work we apply word embeddings to Internet Domain Names traces in order to find semantically similar domains without extra knowledge about domains than its usage. 1 http://www.alexa.com/find-similar-sites/, https://www.similarweb.com/, http://www.similarsitesearch.com/, Google Similar Pages, etc. 978-1-5386-3057-0/17/$31.00 c 2017 IEEE Word embeddings are a set of techniques that map word or phrases of a vocabulary to vectors of real numbers. The idea is that semantically similar words will be assigned nearby vectors so that the model can leverage information learned about some words to other similar words. This is equivalent to transform a discrete space of atomic symbols with one dimension per word to a continuous vector space with lower dimension. This is a much more useful and tractable representation of text. Word embeddings are typically applied to texts in the context of natural language processing in tasks such as syntactic pars- ing, language modeling, and predicting semantically related words [10], [11]. In natural language the context of a word is determined by the words used right after and before it in a phrase, in our work we consider the domain names queried by the same IP address after and before some domain name as the context for this domain (i.e. the trace of DNS queries). For this work we obtained DNS recursive server logs from a large Internet Service Provider (ISP) with anonymized IP addresses. These logs contain each query resolved by a farm of servers. Each line of log indicates the time, the anonymized IP address of the client, the queried domain, and the type of DNS query (A, AAAA, MX, etc.). To create the word embedding for the domains we use the word2vec [12] model implemented using the Python Tensorflow library [13]. In order to evaluate the quality of our results, we explore two alternatives: to have an expert visual inspection of similarities, and to use the mean average precision (MAP) metric to measure the mismatch of similarities between our results and the results obtained from a third party source, namely, similar sites service offered by Alexa Internet, Inc. Using this technique we show that the created embedding effectively embeds semantically similar domains nearby each other and therefore it could be used to build a recommender system, to predict the domains that will be queried in the near future in order to, for example, detect traffic anomalies or apply some cache mechanism. The rest of the paper is organized as follows: Section II introduces to current techniques to find semantically similar domains; Section III explains the Domain Name System on the Internet, its particularities for our problem, and the data used in this work. Section IV introduces the basis of the word2vec algorithm, its applications to other contexts, and how we use it to solve our problem. Section V presents experiments and results of applying word embedding to find semantically