YACIS: A Five-Billion-Word Corpus of Japanese Blogs Fully Annotated with Syntactic and Affective Information Michal Ptaszynski 1 Pawel Dybala 2 Rafal Rzepka 3 Kenji Araki 4 and Yoshio Momouchi 5 Abstract. This paper presents YACIS, a new fully annotated large scale corpus of Japanese language. The corpus is based on blog en- tries from Ameba blog service. The original structure (blog post and comments) is preserved, thanks to which semantic relations between posts and comments are maintained. The corpus is annotated with syntactic (POS, dependency parsing, etc.) and affective (emotive ex- pressions, emoticons, valence, etc.) information. The annotations are evaluated in a survey on over forty respondents. The corpus is also compared to other existing corpora, both large scale and emotion re- lated. 1 INTRODUCTION Text corpora are some of the most vital linguistic resources in natural language processing (NLP). These include newspaper corpora [1], conversation corpora or corpora of literature 6 . Unfortunately, com- paring to major world languages, like English, there are few large corpora available for the Japanese language. Moreover, grand ma- jority of them is based on newspapers, or legal documents 7 . These are usually unsuitable for the research on sentiment analysis and emotion processing, as emotions and attitudes are rarely expressed in this kind of texts. Although there exist conversation corpora with speech recordings, which could become useful in such research 8 , due to the difficulties with compilation of such corpora they are relatively small. Recently blogs have been recognized as a rich source of text available for public. Blogs are open diaries in which people encap- sulate their own experiences, opinions and feelings to be read and commented by other people. Because of their richness in subjective and evaluative information blogs have come into the focus in senti- ment and affect analysis [2, 3, 4, 5]. Therefore creating a large blog- based emotion corpus could become a solution to overcome both problems, of the lack in quantity of corpora and their applicability in the research on sentiment analysis and emotion processing. How- ever, there have been only a few small (several thousand sentences) Japanese emotion corpora developed so far [2]. Although there exist medium scale Web-based corpora (containing several million words), such as JpWaC [6] or jBlogs [7], access to them is usually allowed only from the Web interface, which makes additional annotations (parts-of-speech, dependency structure, deeper affective information, 1 Hokkai-Gakuen University, Japan, email: ptaszynski@hgu.jp 2 Otaru University of Commerce, Japan, email: paweldybala@res.otaru- uc.ac.jp 3 Hokkaido University, Japan, email: kabura@media.eng.hokudai.ac.jp 4 Hokkaido University, Japan, email: araki@media.eng.hokudai.ac.jp 5 Hokkai-Gakuen University, Japan, email: momouchi@eli.hokkai-s-u.ac.jp 6 http://www.aozora.gr.jp/ 7 http://www-nagao.kuee.kyoto-u.ac.jp/NLP Portal/lr-cat-e.html 8 http://www.ninjal.ac.jp/products-k/katsudo/seika/corpus/public/ etc.) difficult. Furthermore, although there exist large resources, like Google N-gram Corpus [8], the textual data sets in such resources are short (up to 7-grams) and do not contain any contextual information. This makes them unsuitable for emotion processing research, since most of contextual information, so important in expressing emotions [9], is lost. Therefore we decided to create a new corpus from scratch. The corpus was compiled using procedures similar to the ones devel- oped in the WaCky initiative [10], but optimized to mining only one blog service (Ameba blog, http://ameblo.jp/, later referred to as Ame- blo). The compiled corpus was fully annotated with syntactic (POS, lemmatization, dependency parsing, etc.) and affective information (emotive expressions, emotion classes, valence, etc.). The outline of the paper is as follows. Section 2 describes the re- lated research in large scale corpora and blog emotion corpora. Sec- tion 3 presents the procedures used in compilation of the corpus. Sec- tion 4 describes tools used in corpus annotation. Section 5 presents detailed statistical data and evaluation of the annotations. Finally the paper is concluded and applications of the corpus are discussed. 2 RELATED RESEARCH In this section we present some of the most relevant research re- lated to ours. There has been no billion-word-scale corpus annotated with affective information before. Therefore we needed to divide the description of the related research into “Large Scale Corpora” and “Emotion Corpora”. 2.1 Large-Scale Web-Based Corpora The notion of a ”large scale corpus” has appeared in linguistic and computational linguistic literature for many years. However, study of the literature shows that what was considered as ”large” ten years ago does not exceed a 5% (border of statistical error) when compared to present corpora. For example, Sasaki et al. [11] in 2001 reported a construction of a question answering (QA) system based on a large scale corpus. The corpus they used consisted of 528,000 newspaper articles. YACIS, the corpus described here consists of 12,938,606 documents (blog pages). The rough estimation indicates that the cor- pus of Sasaki et al. covers less than 5% of YACIS (in particular 4.08%). Therefore we mostly focused on research scaling the mean- ing of ”large” up to around billion-words and more. Firstly, we need to address the question of whether billion-word and larger corpora are of any use to linguistics and in what sense it is better to use a large corpus rather than a medium sized one. This question has been answered by most of the researchers involved in the creation of large corpora, thus we will answer it briefly referring AISB/IACAP 2012 Symposium: Linguistic And Cognitive Approaches To Dialogue Agents (LaCATODA 2012) 40