Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning Hanna Berg Department of Computer and Systems Sciences Stockholm University hanna.berg@dsv.su.se Hercules Dalianis Department of Computer and Systems Sciences Stockholm University hercules@dsv.su.se Abstract Electronic patient records are produced in abundance every day and there is a de- mand to use them for research or man- agement purposes. The records, however, contain information in the free text that can identify the patient and therefore tools are needed to identify this sensitive infor- mation. The aim is to compare two machine learn- ing algorithms, Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) applied to a Swedish clinical data set annotated for de-identification. The re- sults show that CRF performs better than deep learning with LSTM, with CRF giv- ing the best results with an F 1 score of 0.91 when adding more data from within the same domain. Adding general open data did, on the other hand, not improve the re- sults. 1 Introduction Electronic health records (EHR) are today pro- duced in abundance and consist of information valuable to improve the medical care of future pa- tients. They are, however, seldom reused for re- search as free text in patient records often contain possibly identifiable information about patients. To enable access to electronic health records while preserving patient privacy there is a need for auto- matic de-identification. The US Health Insurance Portability and Ac- countability Act (HIPAA) defines 18 categories of Protected Health Information (PHI) which has to be concealed for EHRs to be considered de- identified in the US (Health Insurance Portabil- ity and Accountability Act (HIPAA), 2003). The categories include names, geographic divisions smaller than state, dates related to an individ- ual, contact information and other data that can uniquely identify the individual. Modules built to identify PHI, primarily rely on two methods: Rule-based methods and supervised machine-learning methods (Meystre et al., 2010). The two methods are often used together in hybrid systems (Stubbs et al., 2017). Rule-based meth- ods do not require annotated data for training, are easy to modify and the results are easy to inter- pret, but they lack robustness and designing rules is a complex task (Meystre et al., 2010). Machine learning methods may provide greater robustness, but require an abundant amount of annotated data. According to Dernoncourt et al. (2017), statistical machine learning models require feature engineer- ing, while artificial neural networks (ANN) does not. The latter does, however, require more data. Lee et al. (2017) show that training a model on a large source dataset and then fine-tuning by retraining it on the smaller target data set can improve the results in comparison to only using the smallest data set. While the data sets used by Lee et al. (2017) consisted of 29,000 PHI in- stances in the smaller target data set and 61,000 PHI instances in the larger source data set the largest available Swedish data set, the Stockholm EPR PHI Corpus, has only 4,421 instances of PHI (Velupillai et al., 2009; Dalianis and Velupillai, 2010). It does exist a smaller related corpus with Electronic Health Records with annotations for de-identification, the Stockholm EPR PHI Domain Corpus (Henriksson et al., 2017b). For a larger data set with general Swedish text annotated for named entity recognition, Stockholm Umeå Cor- pus exists (Östling, 2012). This study investigates the possibilities of aug- menting the quality of de-identification by adding a general Swedish data set for named entity recog- nition such as Stockholm Umeå Corpus to already existing annotated PHI data sets and secondly the