(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 8, 2019 108 | Page www.ijacsa.thesai.org Twitter Sentiment Analysis in Under-Resourced Languages using Byte-Level Recurrent Neural Model Ridi Ferdiana 1 , Wiliam Fajar 2 , Desi Dwi Purwanti 3 , Artmita Sekar Tri Ayu 4 , Fahim Jatmiko 5 Department of Electrical Engineering and Information Engineering Universitas Gadjah Mada, Yogyakarta, Indonesia 1, 2, 3, 4 Microsoft Innovation Center, Universitas Gadjah Mada, Yogyakarta, Indonesia 5 Abstract—Sentiment analysis in non-English language can be more challenging than the English language because of the scarcity of publicly available resources to build the prediction model with high accuracy. To alleviate this under-resourced problem, this article introduces the leverage of byte-level recurrent neural model to generate text representation for twitter sentiment analysis in the Indonesian language. As the main part of the proposed model training is unsupervised and does not require much-labeled data, this approach can be scalable by using a huge amount of unlabeled data that is easily gathered on the Internet, without much dependencies on human- generated resources. This paper also introduces an Indonesian dataset for general sentiment analysis. It consists of 10,806 twitter data (tweets) selected from a total of 454,559 gathered tweets which taken directly from twitter using twitter API. The 10,806 tweets are then classified into 3 categories, positive, negative, and neutral. This Indonesian dataset could help the development of Indonesian sentiment analysis especially general sentiment analysis and encouraged others to start publishing similar dataset in the future. Keywords—Sentiment analysis; under-resourced problem; Indonesian dataset; twitter I. INTRODUCTION Sentiment analysis is a problem of systematically identifying and studying personal information. This is commonly translated into the task of classifying polarity detection (thus this term is used interchangeably): Given a piece of written text, the problem is to categorize text into positive or negative classes or can be expanded to the ordinal classification problem. It assigns text to a value (e.g., Numbers from -2 to +2) instead of only positive or negative. There are some who think that polarity detection is not only related to the term sentiment analysis, polarity detection is only one subtask of the sentiment analysis process [1], [2]. However, this article uses the term sentiment analysis and polarity detection interchangeably as a focus on this task in this work. Plenty of methods have been introduced to deal with sentiment analysis problem in previous studies. In general, the method can be either supervised or unsupervised. A lexicon- based approach is often used in unsupervised cases, where a list of words with their sentiment score is required to assign overall sentiment of a document. On the other hand, supervised machine learning techniques can also be considered to build sentiment analysis system because there is no such exact mapping between patterns of character in the text and the polarity of the sentiments (positive or negative). To produce a model from a series of data and let the computer to learn the patterns. There are several machine learning methods for classifying polarity detection: neural networks [3], [4], decision trees [5], support vector machines (SVM) [6], and naive Bayesian [7]. Feature pre-processing and extraction are carried out before classification, which requires large computing power. Both machine-learning and lexical-based methods need extensive resources that are manually prepared. Lexical-based methods need sentiment lexicons, while machine-learning- based needs a lot of labeled data. This may be scarcely available to many languages, especially non-English languages such as Indonesian. Human-generated resources are expensive, which require much time and manual labor. This problem motivates us to ease the problem by adding a resource that may help other researchers to conduct research in this area and proposing a sentiment analysis system that leverages unsupervised approach, which minimizes the need of human- generated resources. In this paper, it is proposed an unsupervised method for addressing the under-resourced problem in sentiment analysis for the Indonesian language. This article presents a methodology to use a byte-level self-supervised neural network to generate sentence representation in sentiment analysis in Indonesian, under the hypothesis that leveraging this method with an existing popular technique such as TF-IDF method will make improvements in this sentiment analysis classification performance. Our main contributions are as follows:  The use of unsupervised approach to minimize the under-resourced problem in the Indonesian language, particularly the byte-level recurrent neural model to generate a representation of sentences.  To gather twitter dataset that contains 10,806 labeled samples and 454,559 unlabeled samples, hoping this would be one resource of doing evaluation benchmark when building a sentiment analysis system in the Indonesian language. II. RELATED WORK This section overviews existing research on sentiment analysis, focusing on sentiment analysis in general, with emphasis on the Indonesian language. The work conducted in this paper is sponsored by Microsoft Rinna