Portable automatic text classiﬁcation for adverse drug reaction detection via multi-corpus training Abeed Sarker ⇑ , Graciela Gonzalez Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd., Scottsdale, AZ 85259, USA article info Article history: Received 7 July 2014 Accepted 2 November 2014 Available online 8 November 2014 Keywords: Pharmacovigilance Adverse drug reaction Social media monitoring Text classiﬁcation Natural language processing Objective: Automatic detection of adverse drug reaction (ADR) mentions from text has recently received signiﬁcant interest in pharmacovigilance research. Current research focuses on various sources of text- based information, including social media—where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and ﬁltered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classiﬁcation of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classiﬁcation accuracies. Methods: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classiﬁcation approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classiﬁcation accuracies. Results: Our feature-rich classiﬁcation approach performs signiﬁcantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions: Our research results indicate that using advanced NLP techniques for generating informa- tion rich features from text can signiﬁcantly improve classiﬁcation accuracies over existing benchmarks. Our experiments illustrate the beneﬁts of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can signiﬁcantly improve classiﬁcation performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). 1. Background Early detection of adverse drug reactions (ADRs) associated with drugs in their post-approval periods is a crucial challenge for pharmacovigilance techniques. Pharmacovigilance is deﬁned as ‘‘the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug problem’’ [1]. Due to the various limitations of pre-approval clinical trials, it is not possible to assess the consequences of the use of a particular drug before it is released [2]. Research has shown that adverse reactions caused by drugs following their release into the market is a major public health problem: with deaths and hospitalizations numbering in millions (up to 5% hospital admissions, 28% emergency visits, and 5% hospital deaths), and associated costs of about seventy-ﬁve billion dollars annually [3–5]. Thus, post-marketing surveillance of drugs is of paramount importance for drug manufacturers, national bodies such as the U.S. Food and Drug Administration (FDA), and international organizations such as the World Health Organization (WHO) [6]. Various resources have been utilized for the monitoring of ADRs, such as voluntary reporting systems and electronic health records. The rapid growth of electronically available health related http://dx.doi.org/10.1016/j.jbi.2014.11.002 1532-0464/Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). ⇑ Corresponding author. E-mail addresses: abeed.sarker@asu.edu (A. Sarker), graciela.gonzalez@asu.edu (G. Gonzalez). Journal of Biomedical Informatics 53 (2015) 196–207 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin