978-1-4799-3516-1/13/$31.00 © 2013 53
Detecting Deceptive Reviews Using Lexical and Syntactic Features
Somayeh Shojaee
*
, Masrah Azrifah Azmi Murad
†
, Azreen Bin Azman
‡
, Nurfadhlina Mohd Sharef
§
and Samaneh Nadali
¶
Faculty of Computer Science and Information Technology
Universiti Putra Malaysia
Selangor, Malaysia
Email: somayeh.shojaee@gmail.com
*
, {masrah
†
, azreenazman
‡
, nurfadhlina
§
}@upm.edu.my, sm.nadeali@gmail.com
¶
Abstract—Deceptive opinion classification has attracted a lot
of research interest due to the rapid growth of social media
users. Despite the availability of a vast number of opinion
features and classification techniques, review classification still
remains a challenging task. In this work we applied stylometric
features, i.e. lexical and syntactic, using supervised machine
learning classifiers, i.e. Support Vector Machine (SVM) with
Sequential Minimal Optimization (SMO) and Naive Bayes,
to detect deceptive opinion. Detecting deceptive opinion by
a human reader is a difficult task because spammers try to
write wise reviews, therefore it causes changes in writing style
and verbal usage. Hence, considering the stylometric features
help to distinguish the spammer writing style to find deceptive
reviews. Experiments on an existing hotel review corpus suggest
that using stylometric features is a promising approach for
detecting deceptive opinions.
Keywords-Deceptive; Opinion; Lexical; Syntactic; Classifica-
tion
I. I NTRODUCTION
Social media is source of information about users’ behav-
iors and interests. There are obvious benefits for different
parties such as companies or governments in understanding
what the public think about their products and services.
User opinion can have impact on sales of products, change
of government policy, and etc. However, the widespread
sharing and employing of user-contributed reviews have also
increased worries about the reliability of them. Manually
detecting and examining fake reviews and reviewers are not
workable because of the problem of information overload.
As the number of reviews increases, different kinds of
methods are established to improve the opinion mining tasks
e.g. [3], [4], [5], [6], [7], [11], [15], [17] and [20]. Most of
the existing works on review spam detection have focused
on spotting review spam that can be detected by a human
but currently detecting deceptive reviews is the concern of
companies, governments and researchers. These kinds of
reviews are not easily determined by a human reader, for
instance one of the following two reviews is spam and the
other is ham (The comments are selected from the corpus
[9]).
• Spam: While traveling for business I had my family join
me and we stayed at the Fairmont Chicago, Millennium
Park. It is breathtakingly beautiful and plush. The rates
for rooms, though not the cheapest choice are quite
reasonable for such fabulous amenities. The room decor
is elegant modern and my wife says the spa is divine!
Guests of this hotel are able to live in luxury - every
room is equipped with complementary coffee, access to
high speed internet, bathrobes, and 42 inch flat screen
televisions. Enjoy the gorgeous city views and live the
good life at the Fairmont Chicago, Millennium Park.
• Ham: My husband and I decided to take a trip to
Chicago at the last minute and quickly chose the Con-
rad, not really knowing it’s location. We were pleas-
antly surprised to find it was almost right on Michigan
Ave. (It’s connect to Nordstroms right in the heart of
downtown. ) Great for shopping. The hotel itself was
fabulous. The staff was extremely helpful in giving
suggestions for shows and restaurants. I recommend
going to Joe’s that’s right across the street from the
hotel. Everyone was raving about it . . .but we couldn’t
get a reservation. The room was wonderful and the
bed and pillows were very comfortable. It had a very
warm and inviting atmosphere. My husband loved the
workout room. I highly recommend the Conrad!!
The authors in [3] used syntactic stylometry features for
deception detection by considering unigram, bigram, and
the combination of them as features. In another work [11]
the authors detected deceptive opinion spam by the help of
machine learning techniques using lexico-syntactic patterns,
such as n-grams (unigrams, bigrams, trigrams) and part-of-
speech (POS) tags and features are derived from the LIWC
(Linguistic Inquiry and Word Count) output.
Using writing-style features to identify spammers have
got the researchers’ attention. Spammers try to write wise
reviews and to hide their writing style; therefore it causes
changes in writing style and verbal usage. To distinguish
truthful reviews from fake reviews, there are several lin-
guistics hints. For instance, spammer in compare with real