Extending the Japanese WordNet Francis Bond, † Hitoshi Isahara, ‡ Kiyotaka Uchimoto, † † Takayuki Kuribayashi † and Kyoko Kanzaki ‡ † NICT Language Infrastructure Group, ‡ NICT Language Translation Group MASTAR Project National Institute of Information and Communications Technology <{bond,isahara,uchimoto,kuribayashi,kanzaki}@nict.go.jp> 1 Introduction Our goal is to make a semantic lexicon of Japanese that is both accesible and usable. To this end we are constructing and releasing the Japanese WordNet (WN-Ja) (Bond et al., 2008b,a). We have almost completed the first stage, where we automatically translated the English and Euro WordNets, and are hand correcting it. We introduce this in Section 2. Currently, we are extending it in three main areas: the first is to add more concepts to the Japanese WordNet, either by adding Japanese to existing English synsets or by creating new synsets (§ 3). The second is to link the synsets to text examples (§ 4). Finally, we are linking it to other resources: the seman- tic lexicon GoiTaikei (Ikehara et al., 1997) and a collection of illustrations taken from the Open ClipArt Library (Phillips, 2005) (§ 5). 2 Current State Currently, the WN-Ja consists of 157,646 senses (word-synset pairs) 36,922 concepts (synsets) and 73,113 unique Japanese words. The relational structure (hypernym, meronym, domain, . . . ) is based entirely on the English WordNet 3.0 (Fell- baum, 1998). Of these entries, 81% have been checked by hand, 11% were automatically cre- ated by linking through multiple languages and 8% were automatically created by adding non- ambiguous translations, as described in Bond et al. (2008a). For up-to-date information on WN-Ja see: nlpwww.nict.go.jp/wn-ja. An example of the entry for the synset 02076196-n is shown in Figure 1. Most fields come from the English WordNet. We have added the underlined fields (Ja Synonyms, Illustration, GoiTaikei) and are currently adding the trans- lated gloss. In the initial automatic construc- tion there were 27 Japanese words associated with the synset, 1 including many inappropriate trans- lations for other senses of seal (e.g., 判こ hanko “stamp”). These were reduced to three after checking: アザラシ, 海 豹 azarashi “seal” and シール shi-ru “seal”. The main focus of this year’s work has been this trimming of badly translated words. The result is a WordNet with a reasonable coverage of common Japanese words. The precision per sense to be just over 90%. We have aimed at high coverage at the cost of precision for two reasons: (i) we think that the WordNet must have a reasonable coverage to be useful for NLP tasks and (ii) we expect to continue refining the accuracy over the following years. 3 Increasing Coverage We are increasing the coverage in two ways. The first is to continue to manually correct the auto- matically translated synsets: there are still some 27,000 unchecked synsets. More interestingly, we wish to add synsets for Japanese concepts that may not be expressed in the English WordNet. To decide which new concepts to add, we will be guided by the other tasks we are are doing: an- notation and linking. We intend to create new synsets for words found in the corpora we anno- tate that are not currently covered, as well for concepts that we want to link to. An example for the first is the concept 御飯 gohan “cooked rice”, as opposed to the grain 米 kome “rice”. An exam- ple of the second is シングル shinguru “single: a song usually extracted from a current or up- coming album to promote the album”. This is a very common hypernym in Wikipedia but missing from the English WordNet. 1 アザラシ, シール, スタンプ, 上封, 判, 判こ, 判子, 刻 印, 加判, 印, 印判, 印形, 印章, 印鎰, 印鑑, 印鑰, 印顆, 墨 引, 墨引き, 封, 封じ目, 封印, 封目, 封着, 封緘, 押し 手, 押 印, 押手, 押捺, 捺印, 極印, 海豹, 版行, 符節, 緘, 証印, 調印