International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-9S, July 2019 414 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: I10660789S19/19©BEIESP DOI: 10.35940/ijitee.I1066.0789S19 An Empirical Analysis to Identify the Effect of Indexing on Influence Detection using Graph Databases Mitali Desai, Rupa G. Mehta, Dipti P. Rana Abstract: The data generated on social media platforms such as Twitter, Facebook, LinkedIn etc. are highly connected. Such data can be efficiently stored and analyzed using graph databases due to the inherent property of graphs to model connected data. To reduce the time complexity of data retrieval from huge graph databases, various indexing techniques are used. This paper presents an extensive empirical analysis on popular graph databases i.e. Neo4j, ArangoDB and OrientDB; with an aim to measure the competencies and effectiveness of primitive indexing techniques on query response time to identify the influencing entities from Twitter data. The analysis demonstrates that Neo4j performs efficient and stable for load, relation and property queries compare to other two databases whereas the performance of OrientDB can be improved using primitive indexing. Index Terms: graph database, index, influence detection, query processing, Twitter I. INTRODUCTION Social media has emerged as a global digital platform for people to communicate, socialize, launch discussion forums, conduct surveys and share their personal information. Apart from such activities, people use social media to express their opinions about any forthcoming events and/or feedbacks about past events. Such data are highly impactful in future predictions and decision making in various sectors such as education, business intelligence, marketing, government bodies, commercial organizations and politics [1]. Each user has the potential to outspread his opinions (positive/negative) across the social network but some users demonstrate their dominance in opinion propagation. Identifying these dominant (influential) users who can influence the opinions of other users by word of mouth, viral promotions or specific actions is helpful for marketing, advertising or spreading influence. Social media data are highly connected. Graph databases are extremely essential for highly connected data produced on social media platforms such as Twitter, Facebook, Google and LinkedIn [2]. Property queries, executed on node properties in graph databases play significant role for influence detection e.g. identifying influencing entities in various domains of social media data, analyzing the influence of food habits on commutative/non commutative diseases, recognizing influencing customers for customer retention analysis. Revised Manuscript Received on 30 June, 2019. Mitali Desai, Rupa G. Mehta, Dipti P. Rana, Computer Engineering Department, Sardar Vallabhbhai National Institute of Technology (SVNIT), Surat, India. , Query responsiveness is a crucial parameter for such highly interactive applications. Traditional sequential search techniques implemented in graph databases demonstrate poor performance due to high computational complexity of pair-wise comparisons, making it unsuitable for real time applications [3]–[ 4]. As the connected and high-volume data in graph databases demand an efficient query processing mechanism [5], various indexing techniques are implemented in graph databases [6]–[11]. In literature, extensive comparisons of various graph databases like Neo4j, ArangoDB, OrientDB, MarkLogic, AllegroGraph, Flock, Titan, GraphDB, HyperGraphDB [12]–[21] are performed, with respect to several aspects of data, storage and query. To the best to our knowledge, no extensive analysis is done on graph databases, with an aim to analyze the effect of indexing on query response time. The objective of this research is to present an extensive analysis on primitive indexing techniques of recent graph databases. Amidst all recent graph databases, Neo4j, ArangoDB and OrientDB are highly exploited as they possess protuberant characteristics like application diversity, ACID property, interactive GUI, simple query language, provision for property graph models and extensive primitive indexing. This research focuses on graph databases supporting single core graph and simple data structure i.e. Neo4j, ArangoDB and OrientDB for influence detection from ‘Russian Troll Tweets’ dataset. II. TECHNICAL BACKGROUND This section emphases on the storage structure and indexing mechanisms of Neo4j, ArangoDB and OrientDB. The related work and incentive of this research are furthermore described in this section. A. Neo4j Neo4j has been recently emerged as a leading graph database [13]–[14]. Neo4j supports property graph model with nodes and relations, and both having properties in terms of attributes-value pairs [19]. It utilizes Cypher Query Language (CQL) for traversal and property queries [12], [21]. Initially the indexing implemented in Neo4j required each index operation to be performed explicitly and manually. To eliminate the manual intervention and enhance the indexing module, Neo4j introduced auto indexing which automatically incorporates any changes made on data into index using the global property key. The global property key excerpts same property of different nodes having no semantic correlation, which degrades query response accuracy [12]. For example, the index on