978-1-4799-3516-1/13/$31.00 © 2013 53 Detecting Deceptive Reviews Using Lexical and Syntactic Features Somayeh Shojaee * , Masrah Azrifah Azmi Murad † , Azreen Bin Azman ‡ , Nurfadhlina Mohd Sharef § and Samaneh Nadali ¶ Faculty of Computer Science and Information Technology Universiti Putra Malaysia Selangor, Malaysia Email: somayeh.shojaee@gmail.com * , {masrah † , azreenazman ‡ , nurfadhlina § }@upm.edu.my, sm.nadeali@gmail.com ¶ Abstract—Deceptive opinion classiﬁcation has attracted a lot of research interest due to the rapid growth of social media users. Despite the availability of a vast number of opinion features and classiﬁcation techniques, review classiﬁcation still remains a challenging task. In this work we applied stylometric features, i.e. lexical and syntactic, using supervised machine learning classiﬁers, i.e. Support Vector Machine (SVM) with Sequential Minimal Optimization (SMO) and Naive Bayes, to detect deceptive opinion. Detecting deceptive opinion by a human reader is a difﬁcult task because spammers try to write wise reviews, therefore it causes changes in writing style and verbal usage. Hence, considering the stylometric features help to distinguish the spammer writing style to ﬁnd deceptive reviews. Experiments on an existing hotel review corpus suggest that using stylometric features is a promising approach for detecting deceptive opinions. Keywords-Deceptive; Opinion; Lexical; Syntactic; Classiﬁca- tion I. I NTRODUCTION Social media is source of information about users’ behav- iors and interests. There are obvious beneﬁts for different parties such as companies or governments in understanding what the public think about their products and services. User opinion can have impact on sales of products, change of government policy, and etc. However, the widespread sharing and employing of user-contributed reviews have also increased worries about the reliability of them. Manually detecting and examining fake reviews and reviewers are not workable because of the problem of information overload. As the number of reviews increases, different kinds of methods are established to improve the opinion mining tasks e.g. [3], [4], [5], [6], [7], [11], [15], [17] and [20]. Most of the existing works on review spam detection have focused on spotting review spam that can be detected by a human but currently detecting deceptive reviews is the concern of companies, governments and researchers. These kinds of reviews are not easily determined by a human reader, for instance one of the following two reviews is spam and the other is ham (The comments are selected from the corpus [9]). • Spam: While traveling for business I had my family join me and we stayed at the Fairmont Chicago, Millennium Park. It is breathtakingly beautiful and plush. The rates for rooms, though not the cheapest choice are quite reasonable for such fabulous amenities. The room decor is elegant modern and my wife says the spa is divine! Guests of this hotel are able to live in luxury - every room is equipped with complementary coffee, access to high speed internet, bathrobes, and 42 inch ﬂat screen televisions. Enjoy the gorgeous city views and live the good life at the Fairmont Chicago, Millennium Park. • Ham: My husband and I decided to take a trip to Chicago at the last minute and quickly chose the Con- rad, not really knowing it’s location. We were pleas- antly surprised to ﬁnd it was almost right on Michigan Ave. (It’s connect to Nordstroms right in the heart of downtown. ) Great for shopping. The hotel itself was fabulous. The staff was extremely helpful in giving suggestions for shows and restaurants. I recommend going to Joe’s that’s right across the street from the hotel. Everyone was raving about it . . .but we couldn’t get a reservation. The room was wonderful and the bed and pillows were very comfortable. It had a very warm and inviting atmosphere. My husband loved the workout room. I highly recommend the Conrad!! The authors in [3] used syntactic stylometry features for deception detection by considering unigram, bigram, and the combination of them as features. In another work [11] the authors detected deceptive opinion spam by the help of machine learning techniques using lexico-syntactic patterns, such as n-grams (unigrams, bigrams, trigrams) and part-of- speech (POS) tags and features are derived from the LIWC (Linguistic Inquiry and Word Count) output. Using writing-style features to identify spammers have got the researchers’ attention. Spammers try to write wise reviews and to hide their writing style; therefore it causes changes in writing style and verbal usage. To distinguish truthful reviews from fake reviews, there are several lin- guistics hints. For instance, spammer in compare with real