IEEE Second International Conference on Data Stream Mining & Processing August 21-25, 2018, Lviv, Ukraine 978-1-5386-2874-4/18/$31.00 ©2018 IEEE 21 Building the Semantic Similarity Model for Social Network Data Streams Svitlana Petrasova National Technical University “Kharkiv Polytechnic Institute” Kharkiv, Ukraine svetapetrasova@gmail.com Nina Khairova National Technical University “Kharkiv Polytechnic Institute” Kharkiv, Ukraine khairova@kpi.kharkov.ua Włodzimierz Lewoniewski Poznan University of Economics and Business Poznan, Poland wlodzimierz.lewoniewski@ue.poznan.pl Abstract— This paper proposes the model for searching similar collocations in English texts in order to determine semantically connected text fragments for social network data streams analysis. The logical-linguistic model uses semantic and grammatical features of words to obtain a sequence of semantically related to each other text fragments from different actors of a social network. In order to implement the model, we leverage Universal Dependencies parser and Natural Language Toolkit with the lexical database WordNet. Based on the Blog Authorship Corpus, the experiment achieves over 0.92 precision. Keywords— social network; data stream; collocations; semantic similarity; blogs; corpus; Universal Dependencies; WordNet I. INTRODUCTION In the last years, social media became a source of communication, data distribution, and an aspect of formation of an informal information space. Many business companies and intelligence agencies have turned to computer processing to monitor these social streams [1]. Main objects of the modern information society are social networks, forums, blogs, etc. Processing such data streams as these, the following factors should be considered: instability of content quality, e.g. spam and fake accounts, and problems with the privacy of users' personal data. All of this requires constant improvement of algorithms for analysis and processing of social data streams. One of the approaches for studying online social structures is Social Network Analysis. Its main objectives are investigation of interactions between social actors and identification of the conditions for the emergence of these interactions [2, 3]. This way, the network of social interactions consists of a finite set of social actors and a set of links between them [4]. Nowadays, the main methods for analyzing social networks are: (1) methods of graph theory for studying the structural relationships of an actor; (2) methods for determining the equivalence of actors; (3) probabilistic models; (4) topological methods that represent the network in the form of some formalized complex of elements and links, etc. However, we suppose that the use of NLP approaches is important for processing social data streams represented by actors’ text information. To date analysing texts of social networks is one of the biggest challenging tasks in NLP. Despite existent NLP applications for IE [5], it is difficult to extract relevant information from the streams of informal natural language sources. In the scope of semantic processing of such texts stream as posted by people in public forums (Facebook, Twitter, LinkedIn, Google+), blogs, etc., we aim to obtain a sequence of semantically related to each other text units from different actors of a social network. In order to solve the issue, we suggest extracting semantically similar units of various levels of the language, i.e. analyzing not only syntactical relations between words or sentences but also semantic correlations between words, phrases and collocations. However, there are currently enough studies concerning the problems of computing words similarity, but relatively few researches are carried out into extracting semantic similar phrases or collocations from natural language texts. A collocation means a combination of two or more words often used together and both syntactically and semantically integrated. In contrast to certain words that are polysemantic and have synonyms, collocations include more particular semantic information. Therefore, semantic similarity of collocations may better identify semantically similar text fragments of the different social actors. This paper addresses the problem of searching similar collocations in English texts in order to determine semantically connected text fragments for Social network data streams analysis. II. RELATED WORK Nowadays, there are a few approaches to extracting semantically similar collocations from texts. At the stage of determining semantic similarity of collocations they mainly use statistical laws, (recurrent) neural networks (e.g. LSTM networks encode patterns of collocations as vector representations) [6], or syntactic characteristics of collocations. For instance, in the paper [7] English synonymous collocation pairs are extracted using translation information. This method gets candidates of synonymous collocation pairs based on a monolingual corpus and a thesaurus, and then selects the appropriate pairs from the candidates using their translations in a second language. The other method [8] collects sets of words and paraphrases via pairwise alignment of sentence fragments. Reference [9] presents a corpus-based method for automatic extraction of paraphrases using multiple English translations of the same source text. Generally, all of these studies work on texts of certain domains and take semantic information from thesauri that Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua