This is the final authors’ version of: Hardie, A and Ibrahim, W (in press) Exploring and classifying the Arabic copula and auxiliary kāna via enhanced part-of-speech tagging. Corpora. In case of any difference between this version and the journal’s typeset version, the latter is to be considered definitive. 1 Exploring and categorising the Arabic copula and auxiliary kāna via enhanced part-of- speech tagging Andrew Hardie * and Wesam Ibrahim † * Linguistics and English Language, Lancaster University, UK † Department of Basic Sciences, Community College, Princess Nourah bint Abdulrahman University, Saudi Arabia; and Department of Foreign Languages, Faculty of Education, Tanta University, Egypt a.hardie@lancaster.ac.uk ; wmibrahim@pnu.edu.sa Abstract Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna, ‘be’, functions additionally as an auxiliary, creating periphrastic tense-aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the million-word Leeds Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s 2012 MADA disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10% samples (499 instances of copula kāna, 387 of auxiliary kāna) are manually analysed to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, substantial new detail, not mentioned in existing grammars, is discovered (e.g. the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna); there exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions, but also pedagogy of Arabic as a first or second/foreign language. 1. Introduction 1 The Arabic grammatical tradition is long-established and sophisticated (Owens, 1990, 1997). Yet in comparison to contemporary linguistic approaches to description of grammar, this tradition offers less attention to matters of syntax as opposed to morphology. Given the complexity of derivation and inflection in Arabic, this is no surprise; a similar focus on morphology over syntax is observable in other classical grammatical traditions, such as the Sanskrit (e.g. the Aṣṭādhyāyī of Pāṇini: Cardona, 1976) or the Greek (e.g. the Tekhnē Grammatikē of Dionysius Thrax: Forbes, 1933:112). An example is the tense-aspect-mood system. Famously, in Classical and Modern Standard Arabic, verbs exhibit two main finite forms, described as perfect/imperfect aspect or 1 Hardie’s work on this paper was supported by the ESRC Centre for Corpus Approaches to Social Science (CASS) (grant reference ES/R008906/1). Ibrahim’s work on this paper was supported by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Fast-track Research Funding Program.