arXiv:1803.04000v1 [cs.CL] 11 Mar 2018 Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages Soumil Mandal 1 , Sainik Kumar Mahata 2 , Dipankar Das 3 1 Department of Computer Science & Engineering, SRM University, Chennai 2,3 Department of Computer Science & Engineering, Jadavpur University, Kolkata {soumil.mandal, sainik.mahata, dipankar.dipnil2005}@gmail.com Abstract Analysis of informative contents and sentiments of social users has been attempted quite intensively in the recent past. Most of the systems are usable only for monolingual data and fails or gives poor results when used on data with code-mixing property. To gather attention and encourage researchers to work on this crisis, we prepared gold standard Bengali-English code-mixed data with language and polarity tag for sentiment analysis purposes. In this paper, we discuss the systems we prepared to collect and filter raw Twitter data. In order to reduce manual work while annotation, hybrid systems combining rule based and supervised models were developed for both language and sentiment tagging. The final corpus was annotated by a group of annotators following a few guidelines. The gold standard corpus thus obtained has impressive inter-annotator agreement obtained in terms of Kappa values. Various metrics like Code-Mixed Index (CMI), Code-Mixed Factor (CF) along with various aspects (language and emotion) also qualitatively polled the code-mixed and sentiment properties of the corpus. Keywords: code-mixed, sentiment classification, language tagging, Twitter data, social media analysis 1. Introduction India has a linguistically diverse and vast diaspora due to its long history of contact with foreigners. English, one of those borrowed languages, became an integral part of the Indian education system and has been recognized as one of the official languages as well, thus giving rise to a popula- tion where bilingualism is very common. This kind of lan- guage diversity coupled with various dialects instigates fre- quent code-mixing in India. This phenomenon has become even more transparent with the rise of social networking sites like Twitter and Facebook and also instant messag- ing services like WhatsApp etc. The writing style in such media indicates phonetic typing transliterated in Roman, generally mixed with English words through code-mixing and also Anglicism. Three facts are involved in this sort of code-mixing cases, 1. lack of knowledge in using appropri- ate native words, 2. typing convenience and 3. popularity of Roman script to cater to a large set of audience. Social networking services has been gaining popularity very rapidly since their first appearance and has led to an exponential growth of minable data which is rich and infor- mative. In developing countries where majority of the pop- ulation are bilinguals, in social media data, we frequent ob- serve a unique trend in typing where two or more languages are mixed for expression known as code-mixing. It is also observed that such code-mixed data are growing rapidly in WWW because multilingual users in social networks fre- quently share their sentiments and thus it becomes an im- portant task to mine and analyze such data for gathering crucial informatics related to sentiment too. However, the complexity involved in mixing of multiple rules of gram- mars, scripts, use of transliteration in such code-mixed data possesses a big challenge for NLP tasks. Thus, it becomes an ever so important task to solve this problem since a huge chunk of the data on social media possesses this property and will be of great use if mined. It has to be mentioned that the conventional meth- ods devised for a single language inevitably fail or give poor results in such cases. Thus to bring more attention of researchers towards this important and challenging as- pect, we developed code-mixed corpora for sentiment anal- ysis in Indian languages. India is country with 255 mil- lion 1 multilingual speakers and one of our goals in this was to challenge the participants and researchers into build- ing advanced and robust systems for sentiment analysis of such code-mixed data. In the present article, we describe the systems and strategies used for making the Bengali- English code-mixed resources. Bengali is an Indo-Aryan language of India where 8.10% of the total population are the first language speakers and is also the official language of Bangladesh. The original script in which Bengali is writ- ten by locals is the Eastern Nagari Script 2 . Majority of our collected data is from Twitter. The reasons why Twitter is an ideal source for collection of such data has been ex- plained by (Pak and Paroubek, 2010). The contributions of our paper are as follows: 1. A method for collecting code-mixed data using filter- ing techniques to assure quality and reduce manual ef- fort. 2. A fast and reliable language identification algorithm (accuracy = 81%) for code-mixed data with known tar- get languages. 3. A sentiment classification system for code-mixed data using a hybrid system (accuracy = 80.97%) combining rule based and supervised models. 4. Gold standard Bengali-English code-mixed data with language and polarity tags. 1 http://rajbhasha.nic.in/UI/pagecontent.aspx?pc=MzU= 2 https://www.omniglot.com/writing/bengali.htm