21 Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German Lothar Lemnitzer 1 , Christian Pölitz 2 , Jörg Didakowski 1 , Alexander Geyken 1 1 Berlin-Brandenburgische Akademie der Wissenschaften, 10117 Berlin, Jägerstr. 22 2 Technische Universität Dortmund, Fakultät für Informatik, Otto-Hahn-Str. 14, 44227 Dortmund E-mail: {lemnitzer,didakowski,geyken}@bbaw.de, poelitz@uni-dortmund.de Abstract The work we will present in this paper is part of a dictionary project at the Berlin-Brandenburg Academy of Sciences and Humanities. For a large number of headwords, example sentences for their respective lexicographic descriptions have to be retrieved from a corpus of contemporary German. Lexicographers are typically faced with a huge number of corpus citations. Therefore, a tool that selects only good examples (those which are considered for inclusion into the dictionary) and dismisses the other ones would be time and effort effective. A rule-based good-example extractor proved to offer a good starting point, but the tool still delivers too many inacceptable citations. We have therefore tried to combine this tool with a machine learner that is trained on the decisions of an experienced lexicographer. The learner has been optimized to reject a large share of the example sentences. We present the machine learning results on a test data set with various combinations of linguistic features and quantify the gain in time and effort for the lexicographers. We also discuss the shortcomings of our approach and suggest some measures to counter them. Keywords: example extraction; machine learning; corpus linguistics; German 1. Introduction and motivation The work that will be reported in this paper originates from a large dictionary project at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). The task is to update a legacy dictionary of contemporary German (Klein & Geyken, 2010). Approximately 45,000 lexical units that have become part of the German vocabulary during the last 40 years have to be registered and handled lexicographically (cf. Geyken & Lemnitzer, 2012). One of the principles of the work is to illustrate the lexicographical description, in particular concerning the meanings and usages of lexical items, with citations from a large German corpus. The underlying corpus has been built and continually extended at the BBAW (cf. Geyken, 2007). A large share of it can be consulted and queried through a search engine on the website of the project (www.dwds.de). The corpus currently contains