© 2015 Ahmed Alsaffar and Nazlia Omar. This open access article is distributed under a Creative Commons Attribution
(CC-BY) 3.0 license.
Journal of Computer Science
Original Research Paper
Integrating a Lexicon Based Approach and K Nearest
Neighbour for Malay Sentiment Analysis
Ahmed Alsaffar and Nazlia Omar
Center for AI Technology, FTSM University Kebangsaan Malaysia, UKM 43000 Bangi Selangor, Malaysia
Article history
Received: 06-05-2015
Revised: 10-06-2015
Accepted: 16-06-2015
Corresponding Author:
Ahmed Alsaffar
Center for AI Technology,
FTSM University Kebangsaan
Malaysia, UKM 43000 Bangi
Selangor, Malaysia
Email: ahmed_saffar5@yahoo.com
Abstract: Sentiment analysis or opinion mining refers to the automatic
extraction of sentiments from a natural language text. Although many
studies focusing on sentiment analysis have been conducted, there remains
a limited amount of studies that focus on sentiment analysis in the Malay
language. In this article, a new approach for automatic sentiment analysis of
Malay movie reviews is proposed, implemented and evaluated. In contrast
to most studies that focus on supervised or unsupervised machine learning
approaches, this research aims to propose a new model for Malay sentiment
analysis based on a combination of both approaches. We used sentiment
lexicons in the new model to generate a new set of features to train a k-
Nearest Neighbour (k-NN) classifier. We further illustrated that our hybrid
method outperforms the state of-the-art unigram baseline.
Keywords: Malay Sentiment Analysis, Feature Extraction, Machine
Learning, Combinations Techniques
Introduction
Opinions are playing a primary role in decision-
making processes. Whenever people need to make a
choice, they are naturally inclined to hear others’
opinions. In particular, when the decision involves
consuming valuable resources, such as the time and/or
money, people strongly rely on their peers’ past
experiences. On the other hand, customers could also
learn about positivity or negativity of different
features of products/services according to users’
opinions, to make an educated purchase. Furthermore,
applications like rating movies based on online movie
reviews (Pang et al., 2002) could not emerge without
making use of these data.
The topic of sentiment analysis has become
extremely popular in the last couple of years. There
has been a tremendous amount of research on this
topic. There are several names for this topic, including
opinion mining and sentiment classification.
Generally, sentiment analysis is a unique case of text
classification, which aims to classify sentiments for
subjective texts, usually customer reviews for some
product or service.
The organizations are looking for opportunities to
analyze the personal opinions that are gathered online
about their services and products to develop their
businesses outcomes. However, there is difficulty in
classifying the large volume of online users’
information in order to reflect the users’ opinions
accurately. Additionally, the users’ express their
opinions based on free texts i.e., unstructured methods
which maximize the difficulty of analyzing the
opinions polarity from these texts (Puteh et al., 2013).
The majority of studies concerns with analyzing
the users’ opinions based on English language. There
has been a very limited amount of research that
focuses on sentiment analysis in the Malay language
(Samsudin et al., 2013).
The main goal of this work is to identify an
optimized set of features that enhance the Malay
sentiment analysis and classifications. We consider
the bag-of-words (unigrams) as a baseline for
sentiment classification. We train the k-Nearest
Neighbour (k-NN) classifier based on the unigram
feature set and compare them against our new
proposed model which combines lexicon knowledge
and a supervised machine learning approach for
Malay sentiment analysis and classification.
There are multiple approaches to sentiment analysis
(SA), which may be separated into three main categories:
Firstly, supervised machine learning approach that has
been implemented in numerous studies (Balahur et al.,
2014; Pang et al., 2002; Greaves et al., 2012; Kang et al.,
2012; Turney, 2002) Secondly, unsupervised machine
learning approach is also a popular technique for
sentiment analysis (Gezici et al., 2013).