International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8, Issue-9S, July 2019
414
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication
Retrieval Number: I10660789S19/19©BEIESP
DOI: 10.35940/ijitee.I1066.0789S19
An Empirical Analysis to Identify the Effect of
Indexing on Influence Detection using Graph
Databases
Mitali Desai, Rupa G. Mehta, Dipti P. Rana
Abstract: The data generated on social media platforms such as
Twitter, Facebook, LinkedIn etc. are highly connected. Such data
can be efficiently stored and analyzed using graph databases due
to the inherent property of graphs to model connected data. To
reduce the time complexity of data retrieval from huge graph
databases, various indexing techniques are used. This paper
presents an extensive empirical analysis on popular graph
databases i.e. Neo4j, ArangoDB and OrientDB; with an aim to
measure the competencies and effectiveness of primitive indexing
techniques on query response time to identify the influencing
entities from Twitter data. The analysis demonstrates that Neo4j
performs efficient and stable for load, relation and property
queries compare to other two databases whereas the performance
of OrientDB can be improved using primitive indexing.
Index Terms: graph database, index, influence detection, query
processing, Twitter
I. INTRODUCTION
Social media has emerged as a global digital platform for
people to communicate, socialize, launch discussion forums,
conduct surveys and share their personal information. Apart
from such activities, people use social media to express their
opinions about any forthcoming events and/or feedbacks
about past events. Such data are highly impactful in future
predictions and decision making in various sectors such as
education, business intelligence, marketing, government
bodies, commercial organizations and politics [1]. Each user
has the potential to outspread his opinions (positive/negative)
across the social network but some users demonstrate their
dominance in opinion propagation. Identifying these
dominant (influential) users who can influence the opinions of
other users by word of mouth, viral promotions or specific
actions is helpful for marketing, advertising or spreading
influence.
Social media data are highly connected. Graph databases
are extremely essential for highly connected data produced on
social media platforms such as Twitter, Facebook, Google
and LinkedIn [2]. Property queries, executed on node
properties in graph databases play significant role for
influence detection e.g. identifying influencing entities in
various domains of social media data, analyzing the influence
of food habits on commutative/non commutative diseases,
recognizing influencing customers for customer retention
analysis.
Revised Manuscript Received on 30 June, 2019.
Mitali Desai, Rupa G. Mehta, Dipti P. Rana, Computer Engineering
Department, Sardar Vallabhbhai National Institute of Technology (SVNIT),
Surat, India.
,
Query responsiveness is a crucial parameter for such highly
interactive applications. Traditional sequential search
techniques implemented in graph databases demonstrate poor
performance due to high computational complexity of
pair-wise comparisons, making it unsuitable for real time
applications [3]–[ 4]. As the connected and high-volume data
in graph databases demand an efficient query processing
mechanism [5], various indexing techniques are implemented
in graph databases [6]–[11].
In literature, extensive comparisons of various graph
databases like Neo4j, ArangoDB, OrientDB, MarkLogic,
AllegroGraph, Flock, Titan, GraphDB, HyperGraphDB
[12]–[21] are performed, with respect to several aspects of
data, storage and query. To the best to our knowledge, no
extensive analysis is done on graph databases, with an aim to
analyze the effect of indexing on query response time. The
objective of this research is to present an extensive analysis on
primitive indexing techniques of recent graph databases.
Amidst all recent graph databases, Neo4j, ArangoDB and
OrientDB are highly exploited as they possess protuberant
characteristics like application diversity, ACID property,
interactive GUI, simple query language, provision for
property graph models and extensive primitive indexing. This
research focuses on graph databases supporting single core
graph and simple data structure i.e. Neo4j, ArangoDB and
OrientDB for influence detection from ‘Russian Troll
Tweets’ dataset.
II. TECHNICAL BACKGROUND
This section emphases on the storage structure and
indexing mechanisms of Neo4j, ArangoDB and OrientDB.
The related work and incentive of this research are
furthermore described in this section.
A. Neo4j
Neo4j has been recently emerged as a leading graph
database [13]–[14]. Neo4j supports property graph model
with nodes and relations, and both having properties in terms
of attributes-value pairs [19]. It utilizes Cypher Query
Language (CQL) for traversal and property queries [12], [21].
Initially the indexing implemented in Neo4j required each
index operation to be performed explicitly and manually. To
eliminate the manual intervention and enhance the indexing
module, Neo4j introduced auto indexing which automatically
incorporates any changes made on data into index using the
global property key. The global property key excerpts same
property of different nodes having no semantic correlation,
which degrades query
response accuracy [12]. For
example, the index on