Distillation of Knowledge from the Research Literature on Alzheimer's Dementia Wutthipong Kongburan King Mongkut's University of Technology Thonburi Thailand 58130800102@st.sit.kmutt.ac.th Mark Chignell University of Toronto Canada chignell@mie.utoronto.ca Jonathan Chan King Mongkut's University of Technology Thonburi Thailand jonathan@sit.kmutt.ac.th ABSTRACT Many countries are aging societies. Since abilities generally deteriorate with age, technologies can assist older adults in their daily life. Loss of cognitive status is particularly severe in cases of dementia, with around 70% (according to Alzheimers.net) of dementia cases involving Alzheimer’s Dementia (AD), a progressive and currently incurable disease. There is considerable research on AD with thousands of relevant publications being added to the PubMed online database every year. The knowledge incorporated in this large body of work is spread across hundreds of thousands of pages of text, making it difficult to distill and mobilize that knowledge in terms of treatments and guidelines. Text mining technology may assist in distilling knowledge from the vast corpus of research literature on Alzheimer’s dementia. In this paper, we apply the Named Entity Recognition (NER) system, a text mining (TM) method used to group words into classes, in order to extract useful information from free texts. We present findings concerning how well NER can extract information from a corpus of AD research publications. CCS CONCEPTS Applied computing → Life and medical sciences → Health care information systems KEYWORDS Aging society; Alzheimer intervention; Named entity recognition; PubMed; Quality of life; 1. INTRODUCTION An estimated 10% of the world population was aged 65 or older as of this writing, and in many countries in Europe and Japan that proportion is over 20% and climbing. In one example of this demographic trend, in 2015 Statistics Canada reported that, for the first time there were more people aged 65 or over than there were under 15 1 . Meanwhile, in Japan, the proportion of elderly (over the age of 65) citizens reached 26% in 2015 2 . An increasing number of older adults is associated with an increased burden of health problems, because many physical and cognitive functions decline even with healthy aging, and declines are typically more pronounced in the case of disease. Alzheimer disease (AD) is one of the most prevalent chronic medical 1 http://www.statcan.gc.ca/daily-quotidien/150929/dq150929b-eng.pdf 2 http://www8.cao.go.jp/kourei/english/annualreport/2014/pdf/c1-1.pdf conditions affecting older people and is a major cause of severe decline in cognition and loss of the ability to live independently. As of this writing there are close to 50 million cases of AD or related dementias worldwide 3 . As many as 50 to 70 percent of all dementia cases are AD, according to Alzheimers.net. In addition, 1-in-9 Americans over 65 has AD 4 . Behavioral symptoms associated with dementia include repetitive speech, wandering, and sleep disturbances, along with loss of memory and an increase in risk of conditions such as depression and delirium. As of this writing there are no effective treatments for AD and the clinical focus has been on managing the symptoms of dementia. Since many types of treatment have been proposed, information about what works when dealing with behavioral problems associated with people at different stages of AD can enhance quality of life not only for those with AD but also for their caregivers. The main aim of the research reported in this paper is to demonstrate how Text Mining (TM) can extract useful information about AD treatments from the scientific literature on AD. First we describe the construction of a training dataset (corpus) from the abstracts of scientific papers with a focus on AD. We then used Named Entity Recognition (NER), trained using the training data set, to label entities of interest within a sample set of real-world test cases. The results demonstrate that NER can be used to classify relevant entities within the AD literature. 2. BACKGROUND NER is a key approach to TM that identifies keywords in text streams and classifies them into predefined relevant categories such as gene, or protein. Various techniques have been proposed to develop NER systems. They can be categorized as rule-based, dictionary-based and Machine Learning (ML)-based (see more information in [2]). As can be seen in [2, 4, 5], when the appropriate resources are available, the ML-based solutions present several advantages, and perform better than dictionary- based and rule-based approaches. In this paper, we use ML- based TM to deal with the problem of NER. We used the NER classifier developed at Stanford University. Stanford NER is a Java implementation of NER labelled sequences of words in a text which include names of people, locations, and company names. NER used the Conditional Random Fields (CRFs) technique [8] to train the classifier based on a training set of labeled entities within a corpus of documents. Other projects that have used CRFs in NER include Gimli [1] and BANNER [9]. These two open source tools automatically tagging genes, proteins and other entity names in 3 https://www.alz.org/documents_custom/2016-facts-and-figures.pdf 4 http://www.alzheimers.net/resources/alzheimers-statistics/ © 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. WWW 2017 Companion, April 3-7, 2017, Perth, Australia. ACM 978-1-4503-4914-7/17/04. DOI: http://dx.doi.org/10.1145/3041021.3054934 1137