Incorporating Unsupervised Features into CRF based Named Entity Recognition Yuki Tawara Nara Institute of Science and Technology tawara.yuki.tn7@is.naist.jp Mai Omura Nara Institute of Science and Technology omura.mai.oz5@is.naist.jp Mirai Miura Nara Institute of Science and Technology miura.mirai.me1@is.naist.jp ABSTRACT We participated in the extraction of complaint and diagnosis Task and the normalization of complaint and diagnosis Task of MedNLP2 in NTCIR11. In the extraction Task, we use CRF based Named Entity Recognition method. Moreover, we incorporate unsupervised features learned from raw cor- pus into CRF. We show such unsupervised features improve system performance. Team Name CL Subtasks Task 1 (Extraction task) Task 2 (Normalization task) Keywords Named Entity Recognition, Conditional Random Fields, Brown Clustering, Word Representation 1. INTRODUCTION In medical ﬁelds, applications of electronic media to in- formation management have been increasing. For example, clinical records have been shifted to electronic media. As a result, utilizing clinical records is desired strongly. Most of information in clinical records is written in natural language, so utilizing electrical record requires Natural Language Pro- cessing (NLP) techniques. However, NLP technique in med- ical ﬁelds is far from well developed. We developed a system for the extraction of complaint and diagnosis Task and the normalization of complaint and diagnosis Task. The extraction of complaint and diagnosis Task is a task to extract expressions which represent com- plaint, diagnosis and time expressions related to a patient, from clinical records prepared for this competition. Nor- malization of complain and diagnosis Task is task to assign ICD-10 class tags to extracted complaint and diagnosis. In addition, we constructed a system to assign modality tags to extracted complaint and diagnosis. For detail of two task and ICD-10, see [1]. The rest of this paper is organized as follows. Section2 explains the method used in our system. Section3 describes experiments we conduct for evaluate our system and its re- sults. Finally Section4 concludes this paper. 2. PROPOSED METHOD Figure 1: An example of assignment of IOB tags to sequence of morphemes 2.1 Extraction of Complaint and Diagnosis Extracting terms related to some domain is refered to Named Entity Recognition (NER). Popular methods used in NER include rule based method, machine learning method such as Maximum Entropy Model, Conditional Random Fields (CRF) [4]. In this paper, we use CRF, which is reported to archive high performance [8], in extracting named entities. 2.1.1 Named Entity Recognition using CRF NER can be considered to assigning IOB tags to sequence of morphemes like the Figure 1, and in such a way it is formalized as sequential labeling. B tag represents its token is located at the beginning of named entity, I tag represents it is located in inside of named entity, O tag represents it is located in outside of named en- tity. CRF is a statistical model which is used in sequential labeling. It is a discriminative model and has an advan- tage in ﬂexibility of incorporating features. In CRF, label sequence is predicted so as to maximize conditional proba- bility of label sequence y given tokens x as below: y = arg max y p(y|x)= 1 Z x exp n ∑ i=1 ∑ k λ k f k (y i-1 ,y i , x) Zx = ∑ y n ∑ i=1 ∑ k λ k f k (yi-1,yi , x) Where Zx is normalizing constant. f k is feature function which is deﬁned by tokens and labels. By feature function, we can incorporate various kind of information to the model. λ k is a weight to be learned from annotated corpus. 2.1.2 Unsupervised Features Generally, training of CRF needs annotated data. How- ever, amount of annotated date is limited and preparing annotated data, especially large amount of annotated data, requires large human power. On the other hand, there ex- ist many documents which contain expressions of complaint and diagnosis mainly on the Web. Therefore, it is worth Proceedings of the 11th NTCIR Conference, December 9-12, 2014, Tokyo, Japan 174