Mining Adverse Drug Reactions from Unstructured Mediums at Scale Hasham Ul Haq, Veysel Kocaman, David Talby John Snow Labs Inc. 16192 Coastal Highway Lewes, DE , USA 19958 {hasham, veysel, david} @ johnsnowlabs.com Abstract Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR’s as early as possible and sharing them with regula- tors, pharma companies, and healthcare providers can pre- vent morbidity and save many lives. While most ADR’s are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR’s in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named En- tity Recognition (NER) model obtains new state-of-the- art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Second, two new Relation Extraction (RE) models are introduced - one based on BioBERT while the other utilizing crafted fea- tures over a Fully Connected Neural Network (FCNN) - are shown to perform on par with existing state-of-the-art mod- els, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classiﬁca- tion model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a uniﬁed NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on com- modity clusters. Introduction Adverse drug events are harmful side effects of drugs, com- prising of allergic reactions, overdose response, and general unpleasant side effects. Approximately 2 million patients in the United States are affected each year by serious ADR’s, resulting in roughly 100,000 fatalities (Leaman et al. 2010), and making ADR’s the fourth leading cause of death in the United States (Giacomini et al. 2007). Treatment related to Copyright © 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. ADR’s has been estimated to cost $136 billion each year in the United States alone (van Der Hooft et al. 2006). Finding all ADR’s of a drug before it is marketed is not practical for several reasons. First, The number of human subjects going through clinical trials is often too small to detect rare ADR’s. Second, many clinical trials are short- lasting while some ADR’s take time to manifest. Third, some ADR’s only show when a drug is taken together with other drugs, and not all drug-drug combinations can be tested during clinical trials. Fourth, drug repurposing or off- label usage can lead to unforeseen ADR’s. As a result, de- tecting ADR’s in drugs which are already being marketed is critical - a discipline known as postmarketing pharmacovig- ilance (Mamm` ı et al. 2013). Schemes which allow hospitals, clinicians, and patients to report ADR’s have existed for many years, but only a frac- tion of events get reported through them. A meta-analysis of 37 studies from 12 countries found that the median rate of under-reporting was 94% (Hazell and Shakir 2006). This led to work on mining ADR’s from alternative sources, such as social media posts by patients or healthcare providers (Bol- legala et al. 2018). Outbreak of the COVID-19 pandemic has precipitated this trend of sharing such information (Cinelli et al. 2020); The size, variety, and instantaneous nature of social media provides opportunities for real-time monitor- ing of ADRs (Sloane et al. 2015). Compared to traditional data source like research publications, this data is more chal- lenging to process, as it is unstructured and contains noise in the form of jargon, abbreviations, misspellings, and complex sentence structures. Recent advancements in Natural Language Processing (NLP) in the form of Transformers (Vaswani et al. 2017) based architectures like BERT (Devlin et al. 2018), have sig- niﬁcantly pushed the boundaries of NLP capabilities. There is an increasing trend of training large models on domain- speciﬁc data like BioBERT (Lee et al. 2019), and these methods have proven to achieve state-of-the-art (SOTA) re- sults for document understanding and named entity recog- nition (NER). However, since these methods require signiﬁ- cant computational resources during both training and infer- ring, it becomes impractical to apply them over large quan- tities of records in compute-restricted production environ- ments. Despite the growing interest and opportunities to process arXiv:2201.01405v2 [cs.CL] 6 Jan 2022