© 2020 JETIR November 2020, Volume 7, Issue 11 www.jetir.org (ISSN-2349-5162)
JETIR2011265 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 932
SENTIMENT ANALYSIS ON BANGLA
YOUTUBE COMMENTS USING MACHINE
LEARNING TECHNIQUES
1
VEERANKI LAKSHMI DURGA,
2
A. MARY SOWJANYA
1
M.Tech,
2
Assistant Professor
Dept of CSSE, Andhra University College Of Engineering (A),
Visakhapatnam, AP, India.
Abstract: Sentiment Analysis (SA) is an opinion mining study analysing people’s opinions, sentiments, evaluations and appraisals towards
Societal entities such as products, services, individuals, organizations, events, etc. Of late, most of the research works on SA in natural language
processing (NLP) are focused on English language. However, it is noted that Bangla language does not have a proper dataset that is both large
and standard. As a result, recent research works with Bangla language in SA have fallen short to produce results that can be both comparable to
works done by others in other languages and reusable for further prospective research. In this work, a substantial textual dataset of both Bangla
and Romanized Bangla texts have been provided which is first of this kind and post-processed, multiple validated, and ready for SA
implementation and experiments. Further, in this project scraping video information from YouTube and validate the data samples into one
of three categories: positive (1), negative (0) and neutral. In this work used real-time analytics, simply means that data is analyzed
right after data becomes available. Real-Time Analytics can produce insights without any delay.
Keywords— Web-scraping; Bangla language; Romanized Bangla; Sentiment Analysis; Text blob
I.INTRODUCTION
Bangla is spoken as the first language by almost 200 million people worldwide, 160 million of whom are Bangladeshi [1].
Bangladeshi people are found to get increasingly involved in online activities such as - getting connected to friends and families through
social media, expressing their opinions and thoughts on popular micro-blogging and social networking sites, sharing opinions and
thoughts by means of comments on online news portals, doing online shopping through online marketplaces and other such applications.
However, it is becoming increasingly harder for such businesses to monitor and analyze market trends, especially when it is done by
analyzing the reaction of the customers on their products or services, due to less or no human-to-human interaction in such businesses.
Moreover, the task of going through comments and reviews from each individual customers and figuring out the sentiments within is
tedious and in some cases simply intractable, especially considering that - usually very high volume of data is generated very quickly in
this day and age of digital connectivity. Therefore, application of automated Sentiment analysis (SA)
Sentiment Analysis can play a vital role here for enhancing efficiency and productivity.SA is widely employed as a machine
learning application in many areas, and is known by many other terms e.g. opinion extraction, sentiment mining, opinion mining,
subjectivity analysis, emotion analysis, review mining, etc. Most of the research works found on SA are based on the English
language, while Bangla SA is still at a formative stage. An interesting work by Das and Bandyopadhyay [2] on subjectivity
detection included Bangla but it is not self-sufficient, as English is also needed. However, none of the works truly considered
Bangladesh's perspective. We need to consider not just standardized Bangla, but Banglish (Bangla words mixed with English
words) and Romanized Bangla. These three major types can again be loosely categorized in - good, standard, bad, wrong, totally
wrong, particular to specific location (almost arcane), etc., depending on the level of clarity, grammatical correctness,
meaningfulness, personal idiosyncrasies, impact of localization etc. Moreover, for the Romanized Bangla the added complexity is
due to the variation in transliteration between people who know English well and those who do not [3]. The reason, that no clear
standard is followed when 160 million Bangladeshi people write in any of the mentioned types, makes it all the more complicated
and challenging to work with.
In the recent past, Deep Learning methods, specifically recurrent model-based deep learning models have enjoyed a lot of
success in Natural Language Processing (NLP), compared to more traditional machine learning methods [4]. While there are
other approaches to SA, in this research we will concentrate exclusively on deep learning based techniques. Our key contributions
cover –
A Web-scraping of YouTube Bangla and Romanized Bangla text samples, where each sample was annotated by two adult
Bangla speakers.
Pre-processing the data in a way so that it is readily usable by researchers.
Application of deep recurrent models on the Bangla and Romanized Bangla text corpus.
Pre-train dataset of one label for another (and vice versa) to see if it gives better results.
The paper is organized as follows. In section 2, we discussed the background of our work and the works of others in the same
field that inspired and helped us in a way. In section 3, we discussed in details about the dataset that we used for our experiments.
Section 4 discusses the methodology and also includes the experimental setup for the deep recurrent models. Section 5 has all the
discussion about various results found from our experimentation, and lastly the article concludes with section 6.