Chemical detection and indexing in PubMed full text articles using deep learning and rule-based methods Tiago Almeida 1 , Rui Antunes 1 , João Figueira Silva 1 , João Rafael Almeida 1,2 , and Sérgio Matos 1§ 1 DETI/IEETA, University of Aveiro, Aveiro, Portugal 2 Department of Computation, University of A Coruña, Spain § Corresponding author. E-mail: aleixomatos@ua.pt. Abstract—Identifying chemicals in biomedical scientific literature is a crucial task for drug development research. The BioCreative NLM-Chem challenge promoted the development of automatic systems that can identify chemicals in full-text articles and decide which chemical concepts are relevant to be indexed. This work describes the participation of the BIT.UA team from the University of Aveiro, where we propose a three-stage automatic pipeline that individually tackles (i) chemical mention detection, (ii) entity normalization and (iii) indexing. We adopted a deep learning solution based on a biomedical BERT variant for chemical identification. For normalization we used a rule-based approach and a hybrid version that explores a dense retrieval mechanism. Similarly, for indexing we also followed two distinct approaches: a rule-based, and a TF-IDF based method. Our best official results are consistently above the official median and benchmark in the three subtasks, with respectively 0.8454, 0.8136, and 0.4664 F1-scores. Keywords—chemical identification; named entity recognition; normalization; chemical indexing; deep learning; transformer based model. I. INTRODUCTION Automatic information extraction from biomedical scientific literature is an essential step for helping in curation tasks, although it is a challenging task far from being solved (1). Particularly, the identification of chemical names advances drug development research. This task, known as named entity recognition (NER), is usually followed by a normalization step where entity mentions are linked to unique codes from a standard vocabulary. Predominantly, only PubMed abstracts have been used for assessing biomedical information extraction systems, as despite the added value of using the extra information in PubMed full-text articles, these pose new challenges stemming from the more detailed explanations and statements, and more complex writing style when compared to abstracts. PubMed provides biomedical researchers, biologists, pharmacologists, epidemiologists, physicians (and others) a way to search for the most relevant research articles. Offering accurate search results expedites their work, and to improve the quality of PubMed search results it is imperative that related information is added to every article. MeSH (Medical Subject Headings) identifiers are used to index articles in PubMed, however, the addition of the appropriate MeSH identifiers for each article is performed manually in a process that costs time and requires expertise. The BioCreative VII Track 2 (NLM- Chem) challenge (2) aims to bring the text mining community to tackle this issue. Participating teams are encouraged to develop computerized solutions and share their systems, since automatic annotations may help expert curators with their manual work. In this paper we describe the methods from our participation in BioCreative VII Track 2 (NLM-Chem). This track comprises two tasks: (i) chemical identification and (ii) chemical indexing. In the first task, the goal is to recognize chemical mentions (named entity recognition) and link predicted entities to their respective MeSH identifiers (normalization). The second task aims to predict the chemical MeSH identifiers that should be used to index each document (that is, find the more relevant MeSH terms for each document). II. DATA Task organizers provided two main datasets (3): training and evaluation, both consisting of PubMed full-text articles. The training dataset corresponds to the NLM-Chem corpus (4) containing 150 documents, whereas the evaluation dataset is comprised of 1387 documents that were scheduled for human indexing in 2021. During the challenge we only had access to the ground truth annotations of the training dataset to develop our system. Regarding the evaluation dataset, only a subset of those articles was manually annotated for the chemical identification task evaluation, whilst for the evaluation of chemical indexing task all articles were used (all documents were manually indexed by human curators). To foster the implementation of enhanced systems, the organizers also shared two other compatible datasets that could help improving the participants’ systems: CHEMDNER (5) and CDR (6). These datasets contain 10000 and 1500 documents respectively, but these documents correspond to PubMed abstracts and not full-text articles as in the NLM-Chem dataset. Both datasets contain the chemical mention annotations and the chemical MeSH indexing identifiers, but only the CDR dataset contains the MeSH identifiers for each chemical mention (normalization). We also used other datasets for helping the NER part. We used the DrugProt training and development subsets provided in BioCreative VII Track 1 (DrugProt), since these contain manually annotated chemical mentions. Documents from DrugProt that also appear in CDR and CHEMDNER (repeated T.A., R.A., J.F.S. and J.R.A. are funded by the FCT - Foundation for Science and Technology (Portuguese national funds) - under the grants 2020.05784.BD, SFRH/BD/137000/2018, PD/BD/142878/2018 and SFRH/BD/147837/2019, respectively.