Towards the Automatic Classification of Speech Subjects in the Danish Parliament Corpus Dorte Haltrup Hansen 1 , Costanza Navarretta 2[0000-0002-4242-9249] , Lene Offersgaard 3 and Jürgen Wedekind 4[0000-0002-0759-6009] Centre for Language Technology, Department of Nordic Studies and Linguistics, University of Copenhagen, Denmark 1 dorteh@hum.ku.dk 2 costanza@hum.ku.dk 3 leneo@hum.ku.dk 4 jwedekind@hum.ku.dk Abstract. This paper addresses the semi-automatic subject area annotation of the Danish Parliament Corpus 2009-2017 in order to construct a gold standard corpus for automatic classification. The corpus consists of the transcriptions of the speeches in the Danish parliamentary meetings. In our annotation work, we mainly use subject categories proposed by Danish scholars in political sciences. The relevant subjects areas of the speeches have been manually annotated using the titles of the agendas items for the parliamentary meetings and then the sub- jects areas have been assigned to the corresponding speeches. Some subjects co- occur in the agendas, since they are often debated at the same time. The fact that the same speech can belong to more subject areas is further analysed. Currently, more than 29,000 speeches have been classified using the titles of the agenda items. Different evaluation strategies have been applied. We also describe auto- matic classification experiments on a subset of the corpus using feature extracted with NLP techniques. The best results (96% F-score) were obtained using fea- tures extracted from the agenda items. These results indicate that the gold stand- ard corpus and agenda items can be used for automatically classify parliamentary debates with high accuracy. Keywords: Parliamentary Debates, Subject Classification, Gold Standard Cor- pus. 1 Introduction The transcriptions of parliamentary debates (Hansards) are available in many countries, and researchers from different disciplines, such as political science, linguistics and computational linguistics, have examined them in a variety of contexts. A classification of the speeches into subject areas is certainly the most basic technique for analysing their content. However, it is beneficiary for practical applications, such as search opti- misation, and it is useful for more sophisticated analyses, e.g. of the tone in the debates