Simultaneous Character-Cluster-Based Word Segmentation and Named Entity Recognition in Thai Language Nattapong Tongtep and Thanaruk Theeramunkong School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University 131 Moo 5, Tiwanont Rd., Bangkadi, Muang, Pathum Thani, Thailand 12000 {nattapong,thanaruk}@siit.tu.ac.th http://www.siit.tu.ac.th Abstract. Named entity recognition in inherent-vowel alphabetic lan- guages such as Burmese, Khmer, Lao, Tamil, Telugu, Bali, and Thai, is difficult since there are no explicit boundaries among words or sen- tences. This paper presents a novel method to exploit the concept of character clusters, a sequence of inseparable characters, to group charac- ters into clusters, utilize statistics among characters and their clusters to extract Thai words and then recognize named entities, simultaneously. Integrated of two phases, the word-segmentation model and the named- entity-recognition model, context features are exploited to learn parame- ters for these two discriminative probabilistic models, i.e., CRFs, to rank a set of word and named entity candidates generated. The experimental result shows that our method significantly increases the performance of segmenting word and recognizing entities with the F-measure of 96.14% and 83.68%, respectively. Keywords: Named Entity Recognition, Word Segmentation, Character Cluster, Information Extraction. 1 Introduction Nowadays, the growing amount of textual information is available in various kinds of formats, especially in digital one. To utilize such information, Informa- tion Extraction (IE) plays an important role to classify, categorize, or transform relevant information from unstructured text documents into a proper format. IE typically involves four steps: named entity extraction, relation extraction [12], co-reference resolution, and slot filling. Extracting named entities is recognized as one of the most important tasks of IE, usually called Named Entity Recognition and Classification (NERC) [7]. Normally, in the recognition step, the boundaries of similar entity occurrences are determined while the types of entities are as- signed to segmented entities such as person name, date, time, organization, and money expressions in the step of classification. T. Theeramunkong et al. (Eds.): KICSS 2010, LNAI 6746, pp. 216–225, 2011. c Springer-Verlag Berlin Heidelberg 2011