Mining Adverse Drug Reactions from
Unstructured Mediums at Scale
Hasham Ul Haq, Veysel Kocaman, David Talby
John Snow Labs Inc.
16192 Coastal Highway
Lewes, DE , USA 19958
{hasham, veysel, david} @ johnsnowlabs.com
Abstract
Adverse drug reactions / events (ADR/ADE) have a major
impact on patient health and health care costs. Detecting
ADR’s as early as possible and sharing them with regula-
tors, pharma companies, and healthcare providers can pre-
vent morbidity and save many lives. While most ADR’s are
not reported via formal channels, they are often documented
in a variety of unstructured conversations such as social
media posts by patients, customer support call transcripts,
or CRM notes of meetings between healthcare providers
and pharma sales reps. In this paper, we propose a natural
language processing (NLP) solution that detects ADR’s in
such unstructured free-text conversations, which improves
on previous work in three ways. First, a new Named En-
tity Recognition (NER) model obtains new state-of-the-
art accuracy for ADR and Drug entity extraction on the
ADE, CADEC, and SMM4H benchmark datasets (91.75%,
78.76%, and 83.41% F1 scores respectively). Second, two
new Relation Extraction (RE) models are introduced - one
based on BioBERT while the other utilizing crafted fea-
tures over a Fully Connected Neural Network (FCNN) - are
shown to perform on par with existing state-of-the-art mod-
els, and outperform them when trained with a supplementary
clinician-annotated RE dataset. Third, a new text classifica-
tion model, for deciding if a conversation includes an ADR,
obtains new state-of-the-art accuracy on the CADEC dataset
(86.69% F1 score). The complete solution is implemented
as a unified NLP pipeline in a production-grade library built
on top of Apache Spark, making it natively scalable and able
to process millions of batch or streaming records on com-
modity clusters.
Introduction
Adverse drug events are harmful side effects of drugs, com-
prising of allergic reactions, overdose response, and general
unpleasant side effects. Approximately 2 million patients in
the United States are affected each year by serious ADR’s,
resulting in roughly 100,000 fatalities (Leaman et al. 2010),
and making ADR’s the fourth leading cause of death in the
United States (Giacomini et al. 2007). Treatment related to
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
ADR’s has been estimated to cost $136 billion each year in
the United States alone (van Der Hooft et al. 2006).
Finding all ADR’s of a drug before it is marketed is not
practical for several reasons. First, The number of human
subjects going through clinical trials is often too small to
detect rare ADR’s. Second, many clinical trials are short-
lasting while some ADR’s take time to manifest. Third,
some ADR’s only show when a drug is taken together with
other drugs, and not all drug-drug combinations can be
tested during clinical trials. Fourth, drug repurposing or off-
label usage can lead to unforeseen ADR’s. As a result, de-
tecting ADR’s in drugs which are already being marketed is
critical - a discipline known as postmarketing pharmacovig-
ilance (Mamm` ı et al. 2013).
Schemes which allow hospitals, clinicians, and patients to
report ADR’s have existed for many years, but only a frac-
tion of events get reported through them. A meta-analysis of
37 studies from 12 countries found that the median rate of
under-reporting was 94% (Hazell and Shakir 2006). This led
to work on mining ADR’s from alternative sources, such as
social media posts by patients or healthcare providers (Bol-
legala et al. 2018). Outbreak of the COVID-19 pandemic has
precipitated this trend of sharing such information (Cinelli
et al. 2020); The size, variety, and instantaneous nature of
social media provides opportunities for real-time monitor-
ing of ADRs (Sloane et al. 2015). Compared to traditional
data source like research publications, this data is more chal-
lenging to process, as it is unstructured and contains noise in
the form of jargon, abbreviations, misspellings, and complex
sentence structures.
Recent advancements in Natural Language Processing
(NLP) in the form of Transformers (Vaswani et al. 2017)
based architectures like BERT (Devlin et al. 2018), have sig-
nificantly pushed the boundaries of NLP capabilities. There
is an increasing trend of training large models on domain-
specific data like BioBERT (Lee et al. 2019), and these
methods have proven to achieve state-of-the-art (SOTA) re-
sults for document understanding and named entity recog-
nition (NER). However, since these methods require signifi-
cant computational resources during both training and infer-
ring, it becomes impractical to apply them over large quan-
tities of records in compute-restricted production environ-
ments.
Despite the growing interest and opportunities to process
arXiv:2201.01405v2 [cs.CL] 6 Jan 2022