There And Back Again – from Dictionary to Wordnet to Thesaurus and Vice Versa: How to Use and Reuse Dictionary Data in a Conceptual Dictionary Henrik Lorentzen, Lars Trap-Jensen Society for Danish Language and Literature Christians Brygge 1, 1219 Copenhagen K, Denmark E-mail: hl@dsl.dk, ltj@dsl.dk Abstract This is a story of elexicographical evolution and how lexical data are used and reused to develop new products and new presentations. At eLex 2009 we demonstrated how data from The Danish Dictionary were used to construct a wordnet for Danish, DanNet, and we showed how, in a betaversion, DanNet data could be used to improve the onomasiological component of the dictionary. Since then, we have used both DanNet and dictionary data in an ongoing project to create a conceptual dictionary – a thesaurus – for Danish. In this article we will show how the thesaurus data help us overcome the major problems connected with the direct use of wordnet data in a dictionary for human users. Focus is on the editing principles of the thesaurus and how data are presented in The Danish Dictionary online. Keywords: thesaurus; wordnet; onomasiological search; e-dictionary 1. Background A dictionary stock of 91,500 entries with 116,000 meaning descriptions and a wordnet consisting of 65,000 synsets connected through 75,000 internal semantic relations provide the starting point for a thesaurus project which is currently being developed at the Society for Danish Language and Literature in Copenhagen. Historically, data from The Danish Dictionary (Den Danske Ordbog, DDO) were used to construct DanNet, and since both serve as input for the thesaurus, all three resources are closely interconnected. The Danish thesaurus project set off in 2010 and will appear in 2013 as a publication of its own – and even as a printed dictionary. In this connection, focus is on the thesaurus data and how they can be used to improve the onomasiological component of the online version of the DDO. In Louvain 2009, we gave a presentation of the dictionary site ordnet.dk, of which the DDO is but one element (the others being a comprehensive historical dictionary and a corpus component), and we showed how data from the Danish wordnet were exploited in a new element, Related words, that was introduced in the online version of the dictionary (Trap-Jensen, 2010). We demonstrated how candidates for Related words could be automatically extracted from DanNet but it was emphasized – as it still is in the version available to the public – that it is a beta version and not without its problems. In this article, we dwell on the nature of the problems involved and how they can be solved by using data from the thesaurus instead. 2. Shortcomings of wordnet data In Trap-Jensen (2010), some problematic areas were mentioned and possible solutions suggested. Let us briefly recapitulate the central issues as well as other shortcomings that we have encountered since. First, there is the problem of overgeneration. This pertains to both co-hyponyms and hyponyms and is due to the fact that the categories are often too broad and the hyponymy hierarchy too shallow. From time to time, Princeton WordNet has been criticised for having too deep and detailed a hierarchical semantic structure and it is therefore important to stress that DanNet is not a translation of Princeton WordNet but was built on original Danish data primarily extracted from the DDO, mainly to ensure that the conceptual world reflected by the Danish language is maintained in the resource. The combination of a relatively shallow semantic hierarchy and a limited number of semantic classes – DanNet operates with approximately 200 ontological types as opposed to the 900-1,000 semantic groups found in thesauri like Roget (2002) for English or Dornseiff (2004) for German – implies a built-in risk of too broad semantic categories resulting in too many and not always the most evident related words when the candidates are automatically extracted from DanNet. Examples in DanNet are the large groups of vocabulary items for persons, plants and verbal nouns with hundreds or, in extreme cases, even thousands of co-hyponyms. To some extent this can be remedied by using combinations of other relations to narrow down the number of members in each category. Our solution to the problem, as described in Trap-Jensen (2010), has been to develop an algorithmic method that ranks words from large groups, primarily based on the number and nature of shared relations. Even if the algorithm has improved usability, the overall conclusion is nevertheless that a good deal of manual effort would still be needed for this element to work properly for the human user. A second major problem with the DanNet data relates to the fact that synsets are seldom found in more than a single place within the semantic network. For example, the assignment of multiple hyperonyms to synsets, Proceedings of eLex 2011, pp. 175-179 175