UTILIZZO DI SVM PER IL RICONOSCIMENTO DI NAMED ENTITY IN ITALIANO EXPLOITING SVM FOR ITALIAN NAMED ENTITY RECOGNITION SOMMARIO/ ABSTRACT In questo articolo presentiamo EntityPro, un sistema per il Named Entity Recognition (NER) basato su Support Vector Machines. EntityPro è addestrato considerando sia features statiche che dinamiche. Il sistema, testato su EVALITA 2007, ha ottenuto una misura F 1 di 82.14% (miglior sistema per il NER dell’italiano). We present EntityPro, a system for Named Entity Recognition (NER) based on Support Vector Machines. EntityPro was trained with a large number of both static and dynamic features. The system performed the best on the task of Italian NER at EVALITA 2007, with an F 1 measure of 82.14. Keywords: Named Entity Recognition, SVM 1. Introduction Named Entity Recognition (NER) is a subtask of Information Extraction which aims to locate and classify words in text into predefined categories such as the names of persons, organizations, locations, time expressions, etc. The most frequently applied techniques for this task are based on machine learning: Hidden Markov Models, Maximum Entropy Models, Support Vector Machines (SVMs). SVMs were introduced in Text Categorization by T. Joachims [2] and subsequently used for many other NLP task, as they scale up well to high feature dimension. SVMs are now among the most popular machine learning techniques, and a number of implementations and development environment are available for them, such as YamCha [3], an open source text chunker that can be easily adapted to other NLP tasks. YamCha allows for handling both static and dynamic features, and for defining a number of parameters such as window-size, parsing-direction (forward/backward) and algorithm of multi-class problems (pair wise/one vs rest). We used YamCha to build EntityPro, a system for recognition of Italian Named Entities, exploiting a rich set of linguistic features such as the Part of Speech, and the occurrence in proper nouns gazetteers. EntityPro is part of TextPro a suite of modular NLP tools developed at FBK-irst. The EntityPro tagger has recently been trained on the EVALITA development set in which named entities are represented with IOB2 format. The data contains entities of four types: Geo-Political entity (GPE), Location (LOC), Organization (ORG) and Person (PER). We assume that named entities are not-recursive and not- overlapping. In this rest of the paper we provide further details on the feature space that we used, and the results we obtained. Figure 1: EntityPro’s architecture 2. EVALITA NER task Both development test data are part of the Named Entity task of EVALITA 2007. Other external resources are allowed. EntityPro was configured splitting the development set randomly into two parts: a data set for training (92,241 tokens) and a data set for tuning the system (40,348 tokens). The resulting best configuration was tested on the test set. For each running word rich set of features (18) are extracted: the word itself, both unchanged and lower- cased; its Part of Speech, as produced by TagPro [1]; prefixes and suffixes (1, 2, 3, or 4 characters at the start/end of the word); orthographic information (e.g. EMANUELE PIANTA · ROBERTO ZANOLI CONTRIBUTI SCIENTIFICI 69 Anno IV, N° 2, Giugno 2007