Building layered, multilingual sentiment lexicons at synset and lemma levels Fermín L. Cruz , José A. Troyano, Beatriz Pontes, F. Javier Ortega Department of Languages and Computer Systems, University of Seville, Av. Reina Mercedes s/n, 41012 Sevilla, Spain article info Keywords: Sentiment analysis Multilingual sentiment lexicons Spanish resources for sentiment analysis abstract Many tasks related to sentiment analysis rely on sentiment lexicons, lexical resources containing information about the emotional implications of words (e.g., sentiment orientation of words, positive or negative). In this work, we present an automatic method for building lemma-level sentiment lexicons, which has been applied to obtain lexicons for English, Spanish and other three official languages in Spain. Our lexicons are multi-layered, allowing applications to trade off between the amount of available words and the accuracy of the estimations. Our evaluations show high accuracy values in all cases. As a previous step to the lemma-level lexicons, we have built a synset-level lexicon for English similar to SENTIWORDNET 3.0, one of the most used sentiment lexicons nowadays. We have made several improvements in the original SENTIWORDNET 3.0 building method, reflecting significantly better estimations of positivity and negativity, according to our evaluations. The resource containing all the lexicons, ML-SENTICON, is publicly available. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction Sentiment analysis is a modern subdiscipline of Natural Lan- guage Processing which deals with subjectivity, affects and opin- ions in texts (a good survey on this subject can be found in Pang & Lee (2008) and Liu & Zhang (2012)). It is a very active research area, since opinions expressed on the Internet by users constitute a very valuable information for governments, companies and con- sumers, and its large volume and the high rate of appearance require automated analysis methods. Detection of subjectivity, text classification based on the overall sentiment expressed (positive vs. negative), or extraction of individual opinions and their partic- ipants, are three of many tasks addressed. Some of these tasks rely on sentiment lexicons as a component of the solutions. A sentiment lexicon is a lexical resource containing information about the emotional implications of words. Commonly, this information refers to the prior polarity (positive vs. negative) of words, i.e. the positive or negative nature of words, regardless of context. For example, the word ‘‘good’’ has a positive prior polarity, although it may be used in a negative sentence (‘‘His second album is not so good’’). In this paper we present new sentiment lexicons for English, Spanish and other three official languages in Spain. The lexicons are multi-layered, allowing applications to trade off between the amount of available words and the accuracy of the estimations of their prior polarities. As a previous step, we have reproduced the method proposed by Baccianella, Esuli, and Sebastiani (2010) to build SENTIWORDNET 3.0, one of the most used sentiment lexicons nowadays. We have introduced several improvements to the original method, affecting positively the accuracy of the resource obtained, according to our evaluations. We believe that the resource containing all the lexicons, ML- SENTICON, can be useful in many sentiment applications for both English and Spanish. The automatic method proposed here could also be reproduced for new languages, whenever WordNet ver- sions for those languages are available. This is advantageous in that it allows to quickly obtain sentiment lexicons for new languages that lack such resources. However, it should also be noted that any lexicon constructed by automatic or semi-automatic methods must be used with caution, as they will inevitably contain errors (words incorrectly labelled as positive or negative). In this sense, it is a good practice to have the lexicons reviewed by native speak- ers. In the case of ML-SentiCon, layers 1–4 have been completely reviewed. Although the remaining layers have not been reviewed, evaluations based on statistically representative random sample indicate a tolerable error rate up to layer 7 (see Section 4.3 for details). The structure of the paper is as follows. In Section 2, we review some related works on sentiment lexicons, including a description http://dx.doi.org/10.1016/j.eswa.2014.04.005 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. Corresponding author. Address: Escuela Técnica Superior de Ingeniería Informá- tica, Av. Reina Mercedes s/n, 41012 Sevilla, Spain. Tel.: +34 954 55 62 33. E-mail address: fcruz@us.es (F.L. Cruz). Expert Systems with Applications 41 (2014) 5984–5994 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa