Gradient Boosted Trees for Identification of Complex Words in Context Raksha Agarwal, Niladri Chatterjee Indian Institute of Technology Delhi, Hauz Khas, Delhi-110016, India Abstract Determining whether a particular word is complex in a given context is an important task for modern NLP, as the presence of complex words may hinder smooth communication. The present work focuses on developing a binary classifer for predicting the complexity of a target word. A set of 51 features, pertaining to eight diferent classes, has been identifed for the said purpose. Four diferent classifers have been used, and their performance is compared. CatBoost registered the best performance when tested on CWI2016 dataset, and for the News and Wikinews categories for CWI2018 dataset. In fact, the CatBoost system supersedes the top performers for the 2016 and 2018 contests for the above-mentioned cases. The optimal feature subsets for the datasets are obtained using recursive feature elimination. Keywords Complex Word Identifcation, Linguistic Features, CatBoost, Domain Adaptation 1. Introduction Presence of difcult words in a text can lower readability and comprehension for second language learners as well as for native speakers with low literacy levels and reading difculties [1]. This can lead to miscommunication of ideas and/or misunderstanding of contents. Automatic identifcation of difcult-to-understand words in a given sentence has been considered as a core part of Lexical Simplifcation (LS) systems by several works in the past [2, 3]. This task is commonly referred to as Complex Word Identifcation (CWI). Absence of CWI from LS systems, and adopting a ’Simplify Everything’ [4] approach may obscure the meaning of the source sentence due to redundant substitutions of simple words. CWI systems are categorized into four types, namely Threshold-based, Lexicon-based, Implicit CWI and Machine learning-assisted [5]. Threshold-based system segregate complex and simple words by setting a threshold value on a simplicity metric, such as word frequency [2, 6]. Lexicon- based systems make use of domain-specifc lexicons for CWI to replace a complex word with a simple word/phrase with similar meaning [7]. Implicit CWI systems, instead of identifying complex words, focus on determining whether or not a word can be replaced by a simpler alternative [8]. CWI systems of the above types ignore the efect of context in determining the complexity of a word. Machine learning-assisted CWI systems are enabled to design classifers on an extensive feature space comprising shallow features of the target word (e.g. length of the Proceedings of the First Workshop on Current Trends in Text Simplifcation (CTTS 2021), co-located with SEPLN 2021. September 21st, 2021 (Online). Saggion, H., Štajner, S. and Ferrés, D. (Eds). E raksha.agarwal@maths.iitd.ac.in (R. Agarwal); niladri@maths.iitd.ac.in (N. Chatterjee) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 12