Learning Sentiment Lexicons in Spanish Ver ´ onica P´ erez-Rosas, Carmen Banea, Rada Mihalcea Department of Computer Science University of North Texas {veronica.perezrosas, carmen.banea}@gmail.com, rada@cs.unt.edu Abstract In this paper we present a framework to derive sentiment lexicons in a target language by using manually or automatically annotated data available in an electronic resource rich language, such as English. We show that bridging the language gap using the multilingual sense-level aligned WordNet structure allows us to generate a high accuracy (90%) polarity lexicon comprising 1,347 entries, and a disjoint lower accuracy (74%) one encompassing 2,496 words. By using an LSA-based vectorial expansion for the generated lexicons, we are able to obtain an average F-measure of 66% in the target language. This implies that the lexicons could be used to bootstrap higher-coverage lexicons using in-language resources. Keywords: multilingual natural language processing, multilingual subjectivity and sentiment analysis, lexicon generation 1. Introduction Subjectivity and sentiment analysis focuses on the auto- matic identiﬁcation of private states, such as opinions, emo- tions, sentiments, evaluations, beliefs, and speculations in natural language. While subjectivity classiﬁcation labels text as either subjective or objective, sentiment classiﬁca- tion adds an additional level of granularity, by further clas- sifying subjective text as either positive, negative or neu- tral. A large number of text processing applications have already used techniques for automatic sentiment and sub- jectivity analysis, including expressive text-to-speech syn- thesis (Alm et al., 2005), tracking sentiment timelines in on-line forums and news (Lloyd et al., 2005; Balog et al., 2006), analysis of political debates (Thomas et al., 2006; Carvalho et al., 2011), question answering (Yu and Hatzivassiloglou, 2003), and conversation summarization (Carenini et al., 2008). Much of the research work to date on sentiment and sub- jectivity analysis has been applied to English, but work on other languages is growing, including Japanese (Kobayashi et al., 2004; Suzuki et al., 2006; Takamura et al., 2006; Kanayama and Nasukawa, 2006), Chinese (Hu et al., 2005; Tsou et al., 2005; Zagibalov and Carroll, 2008), German (Kim and Hovy, 2006), and Romanian (Mihalcea et al., 2007; Banea et al., 2008b). In addition, several participants in the Chinese and Japanese Opinion Extraction tasks of NTCIR-6 (Kando et al., 2008) performed subjectivity and sentiment analysis in languages other than English. As only 27% of Internet users speak English, 1 the construc- tion of resources and tools for subjectivity and sentiment analysis in languages other than English is a growing need. In this paper, we propose a new method to build a subjectiv- ity and sentiment lexicon for Spanish, which we will later employ to perform sentence level sentiment classiﬁcation, as well as seek to enrich through a bootstrapping process in the target language. 1 www.internetworldstats.com/stats.htm, Oct 11, 2011 2. Related Work Lexicons have been widely used for sentiment and subjec- tivity analysis, as they represent a simple, yet effective way to build rule-based opinion classiﬁers. For instance, one of the most frequently used lexicons is the subjectivity and sentiment lexicon provided with the OpinionFinder distri- bution (Wiebe and Riloff, 2005). The lexicon was com- pile from manually developed resources augmented with entries learned from corpora, and it contains 6,856 unique entries that are also associated with a polarity label, indi- cating whether the corresponding word or phrase is posi- tive, negative, or neutral. SentiWordNet (Esuli and Sebas- tiani, 2006) is a resource for opinion mining built on top of WordNet, which assigns each synset in WordNet with a score triplet (positive, negative, and objective), indicating the strength of each of these three properties for the words in the synset. The SentiWordNet annotations encompass more than 100,000 words and were automatically gener- ated, starting with a small set of manually labeled synsets. While there are several English lexicons for sentiment and subjectivity analysis, we are only aware of a very small number of such lexicons available for other languages. (Abdul-Mageed et al., 2011) manually compiled a list of approximately 4,000 Arabic adjectives from the newswire domain annotated for polarity. (Clematide and Klenner, 2010) extract a list of 8,000 nouns, verbs, and adjectives in German annotated for polarity and strength. Most efforts to date, though, have focused on automatic procedures of lexicon construction, such as (Kaji and Kitsuregawa, 2007) for Japanese, (Lu et al., 2010; Xu et al., 2010) for Chinese, or (Banea et al., 2008a) for Romanian. The work closest to ours is authored by (Rao and Ravichandran, 2009), who introduce a lexicon induction method that uses the Word- Net graph and the relationships it entails to extend polar- ity classiﬁcation to other words using graph based semi- supervised learning algorithms, such as mincuts, random- ized mincuts, and label propagation. The latter method is the best performing one and was applied to Hindi (employ-