Automatic de-identification of protected health information Jelena Jaćimović* † , Cvetana Krstev*, Drago Jelovac † * University of Belgrade, Faculty of Philology Studentski trg 3, 11000 Belgrade, Serbia jjacimovic@rcub.bg.ac.rs cvetana@matf.bg.ac.rs † University of Belgrade, School of Dental Medicine Dr. Suboti ća 8, 11000 Belgrade, Serbia drago.jelovac@stomf.bg.ac.rs Abstract This paper presents an automatic de-identification system for Serbian, grounded on a rapid adaptation of the existing named entity recognition system. Based on a finite-state methodology and lexical resources, the system is designed to detect and replace all explicit personal protected health information present in the medical narrative texts, while still preserving all the relevant medical concepts. The results of a preliminary evaluation demonstrate the usefulness of this method both in preserving patient privacy and the de- identified document interoperability. Avtomatska dezidentifikacija zaščitenih zdravstvenih podatkov V prispevku predstavimo sistem za avtomatsko dezidentifikacijo v srbščini, ki temelji na hitri prilagoditvi obstoječega sist ema za identifikacijo imenskih entitet. Sistem je zasnovan na metodologiji končnih avtomatov in jezikovnih virov ter identificira in zamenja vse eksplicitne zaščitene zdravstvene osebne podatke v medicinskih narativnih besedilih, pri čemer pa ohrani releva ntne medicinske koncepte. Rezultati preliminarne evalvacije so pokazali uporabnost te metode, in sicer tako pri zaščiti osebnih podatkov paci entov kot pri interoperabilnosti dezindentificiranih dokumentov. 1. Introduction Current advances in health information technology enable health care providers and organizations to automate most aspects of the patient care management, facilitating collection, storage and usage of patient information. Such information, stored in the form of electronic medical records (EMRs), represents accurate and comprehensive clinical data valuable as a vital resource for secondary uses such as quality improvement, research, and teaching. Besides the vast useful information, narrative clinical texts of the EMR also include many items of patient identifying information. For both ethical and legal reasons, when confidential clinical data are shared and used for research purposes, it is necessary to protect patient privacy and remove patient-specific identifiers through a process of the de-identification. A de-identification is focused on detecting and removing/modifying all explicit personal Protected Health Information (PHI) present in the medical or other records, while still preserving all the medically relevant information about the patient. Various standards and regulations for health data protection define multiple directions to achieve the de-identification, but the most frequently referenced regulation is the US Health Information Portability and Accountability Act (HIPAA) (HIPAA, 1996). According to the HIPAA “Safe Harbor” approach, the clinical records are considered de-identified when 18 categories of PHI are removed, and the remaining information cannot be used alone or in combination with other information to identify an individual. These PHI categories include names, geographic locations, elements of dates (except year), telephone and fax numbers, medical record numbers or any other unique identifying numbers, among others. Since manual removal of PHI by medical professionals proved to be prohibitively time-consuming, tedious, costly and unreliable (Douglass et al., 2004; Neamatullah et al., 2008; Deleger et al., 2013), extracting PHI requires more reliable, faster and cheaper automatic de-identification systems based on Natural Language Processing (NLP) methods (Meystre et al., 2010). The extraction of PHI can be viewed as a Named Entity Recognition (NER) problem applied in medical domain for the de-identification (Nadeau, 2007). However, even though both traditional NER and the de- identification involve the automatic recognition of particular phrases in text (persons, organizations, locations, dates, etc.), the de-identification differs in important ways from traditional NER (Wellner et al., 2007). In contrast to general NER focused on newspaper texts, the de-identification deals with the clinical narratives characterized by fragmented and incomplete utterances, the lack of punctuation marks and formatting, many spelling and grammatical errors, as well as domain specific terminology and abbreviations. Since the de- identification is the first step towards identification and extraction of other relevant clinical information, it is extremely important to overcome the problem of significantly large number of eponyms and other non-PHI erroneously categorized as PHI. For instance, the anatomic locations, devices, diseases and procedures could be erroneously recognized as PHI and removed (e.g. “The Zvezdara method” 1 vs. Clinical Center “Zvezdara”), reducing the usability and the overall meaning of clinical notes, and thus the accuracy of subsequent automatic processes performed on the de- identified documents. In this paper we introduce our automatic clinical narrative text de-identification system, based on a rapid 1 The original surgical 2-step arteriovenous loop graft procedure developed in Clinical Center “Zvezdara”, Belgrade, Serbia. Zvezdara is a municipality of Belgrade.