IEEE Second International Conference on Data Stream Mining & Processing
August 21-25, 2018, Lviv, Ukraine
978-1-5386-2874-4/18/$31.00 ©2018 IEEE 21
Building the Semantic Similarity Model for Social
Network Data Streams
Svitlana Petrasova
National Technical University “Kharkiv
Polytechnic Institute”
Kharkiv, Ukraine
svetapetrasova@gmail.com
Nina Khairova
National Technical University “Kharkiv
Polytechnic Institute”
Kharkiv, Ukraine
khairova@kpi.kharkov.ua
Włodzimierz Lewoniewski
Poznan University of Economics and
Business
Poznan, Poland
wlodzimierz.lewoniewski@ue.poznan.pl
Abstract— This paper proposes the model for searching
similar collocations in English texts in order to determine
semantically connected text fragments for social network data
streams analysis. The logical-linguistic model uses semantic
and grammatical features of words to obtain a sequence of
semantically related to each other text fragments from
different actors of a social network. In order to implement the
model, we leverage Universal Dependencies parser and Natural
Language Toolkit with the lexical database WordNet. Based on
the Blog Authorship Corpus, the experiment achieves over 0.92
precision.
Keywords— social network; data stream; collocations;
semantic similarity; blogs; corpus; Universal Dependencies;
WordNet
I. INTRODUCTION
In the last years, social media became a source of
communication, data distribution, and an aspect of formation
of an informal information space. Many business companies
and intelligence agencies have turned to computer processing
to monitor these social streams [1].
Main objects of the modern information society are social
networks, forums, blogs, etc. Processing such data streams as
these, the following factors should be considered: instability
of content quality, e.g. spam and fake accounts, and
problems with the privacy of users' personal data. All of this
requires constant improvement of algorithms for analysis and
processing of social data streams.
One of the approaches for studying online social
structures is Social Network Analysis. Its main objectives are
investigation of interactions between social actors and
identification of the conditions for the emergence of these
interactions [2, 3]. This way, the network of social
interactions consists of a finite set of social actors and a set
of links between them [4].
Nowadays, the main methods for analyzing social
networks are: (1) methods of graph theory for studying the
structural relationships of an actor; (2) methods for
determining the equivalence of actors; (3) probabilistic
models; (4) topological methods that represent the network
in the form of some formalized complex of elements and
links, etc.
However, we suppose that the use of NLP approaches is
important for processing social data streams represented by
actors’ text information. To date analysing texts of social
networks is one of the biggest challenging tasks in NLP.
Despite existent NLP applications for IE [5], it is difficult to
extract relevant information from the streams of informal
natural language sources.
In the scope of semantic processing of such texts stream
as posted by people in public forums (Facebook, Twitter,
LinkedIn, Google+), blogs, etc., we aim to obtain a sequence
of semantically related to each other text units from different
actors of a social network. In order to solve the issue, we
suggest extracting semantically similar units of various levels
of the language, i.e. analyzing not only syntactical relations
between words or sentences but also semantic correlations
between words, phrases and collocations. However, there are
currently enough studies concerning the problems of
computing words similarity, but relatively few researches are
carried out into extracting semantic similar phrases or
collocations from natural language texts.
A collocation means a combination of two or more words
often used together and both syntactically and semantically
integrated. In contrast to certain words that are polysemantic
and have synonyms, collocations include more particular
semantic information. Therefore, semantic similarity of
collocations may better identify semantically similar text
fragments of the different social actors.
This paper addresses the problem of searching similar
collocations in English texts in order to determine
semantically connected text fragments for Social network
data streams analysis.
II. RELATED WORK
Nowadays, there are a few approaches to extracting
semantically similar collocations from texts. At the stage of
determining semantic similarity of collocations they mainly
use statistical laws, (recurrent) neural networks (e.g. LSTM
networks encode patterns of collocations as vector
representations) [6], or syntactic characteristics of
collocations.
For instance, in the paper [7] English synonymous
collocation pairs are extracted using translation information.
This method gets candidates of synonymous collocation
pairs based on a monolingual corpus and a thesaurus, and
then selects the appropriate pairs from the candidates using
their translations in a second language. The other method [8]
collects sets of words and paraphrases via pairwise alignment
of sentence fragments. Reference [9] presents a corpus-based
method for automatic extraction of paraphrases using
multiple English translations of the same source text.
Generally, all of these studies work on texts of certain
domains and take semantic information from thesauri that
Lviv Polytechnic National University Institutional Repository http://ena.lp.edu.ua