Applying a Finite Automata Acquisition Algorithm to Named Entity Recognition Muntsa Padr´ o and Llu´ ıs Padr´ o TALP Research Center Universitat Polit` ecnica de Catalunya {mpadro, padro}@lsi.upc.edu Abstract. In this work, Causal-State Splitting Reconstruction algorithm, originally conceived to model stationary processes by learning ﬁnite state automata from data sequences, is for the ﬁrst time applied to NLP tasks, namely Named Entity Recognition. The obtained results are slightly below the best systems presented in CoNLL 2002 shared task, though given the simplicity of the used features, they are really promising. Once the viability of using this algorithm for NLP tasks is stated, we plan to improve the results obtained at NER task, as well as to apply it to other NLP sequence recognition tasks such as PoS tagging, chunking, subcategorization patterns acquisition, etc. 1 Introduction Some Natural Language Processing (NLP) tasks may be naturally approached using ﬁnite state automata and machine learning algorithms. These automata can be hand built with linguistic knowledge or can be statistical models, such as Hidden Markov Models (HMM). In the case of statistical automata, usually their structure must be previously deﬁned. For HMM, for example, it is necessary to deﬁne what the states represent, and the statistics are only applied to learn the transition and emission probabilities. Nevertheless there are algorithms that learn automata given some data [1, 2, 3, 4, 5, 6]. One of these kind of algorithms is CSSR (Causal State Splitting Reconstruction) which is based on inferring the causal states of a process given sequential data. In this work a ﬁrst approach to applying this algorithm to NLP tasks is presented. The task chosen to start applying this algorithm was Named Entity Recognition (NER). The results presented in this paper are preliminary, since the performed experiments take into account few features. Nevertheless, the obtained results are quite promising since they are not far from those of the state of the art systems and there is still a large margin for improvement to the presented preliminary experiments. At the sight of current results, it can be said that this algorithm can be reliably applied to NER and we expect to obtain good results in the future applying it to other NLP tasks. The Named Entity Recognition (NER) task consists of detecting names re- ferring to entities such as persons, locations, organizations, etc. in a text. This A. Yli-Jyr¨a, L. Karttunen,and J. Karhum¨aki(Eds.): FSMNLP 2005, LNAI 4002, pp. 203–214, 2006. c  Springer-Verlag Berlin Heidelberg 2006