Named Entity Recognition in Albanian Based on CRFs Approach Gridi Kono Department of Informatics Faculty of Natural Sciences University of Tirana 1001 Tirana, Albania gridi.kono@gmail.com Klesti Hoxha Department of Informatics Faculty of Natural Sciences University of Tirana 1001 Tirana, Albania klesti.hoxha@fshn.edu.al Abstract Named Entity Recognition (NER) refers to the process of extracting named entities (peo- ple, locations, organizations, sport teams, etc.) from text documents. In this work we describe our NER approach for documents written in Albanian. We explore the use of Conditional Random Fields (CRFs) for this purpose. Adequate annotated training cor- pora are not yet publicly available for Alba- nian. We have created our own corpus an- notated manually by humans. The domain of this corpus is based on Albanian news documents published in 2015 and 2016. We have tested our trained model with two test sets. Overall precision, recall and F-score are 83.2%, 60.1% and 69.7% respectively. 1 Introduction Named Entity Recognition (NER) is an important tool in almost all Natural Language Processing (NLP) ap- plication areas. NLP systems that include some form of information extraction have gained much attention from both the academic and business intelligence com- munity. Identifying and classifying words of text into differ- ent classes is a process defined as named entity recog- nition (NER) [ZPZ04]. In simple terms, a named en- tity is a group of consecutive words found in a sen- tence, and representing entities of the real world such as people, locations, organizations, dates, etc. For instance in the following sentence: ”Matteo Renzi is an Italian politician who has been the Prime Minister of Italy since 22 February 2014 and Secretary of the Democratic Party since 15 December 2013.”, ”Mat- teo Renzi”, ”Italy” and ”Democratic Party” can be classified as person, location and organization entities, respectively. In this work we describe a machine learning ap- proach for recognizing named entities in Albanian text documents. The Albanian language lacks of publicly available annotated training corpora for NER. We have created a custom annotated corpus consisting of news articles written in Albanian published in various on- line news media. The corpus has been created using a custom built web application software that allowed for n-gram based annotation sessions. Experiments were conducted using Standford CRF based NER toolkit 1 . Results were promising despite the small size of the created corpus. The rest of this paper is structured as follows. In Section 2 we will present previous works in NER and related approaches. In Section 3 the Conditional Random Fields approach is described. In Section 4 we will describe our corpus and the methodology used for creating it. In Section 5 we will present experi- ments and their results. Finally, Section 6 concludes the paper. 2 Related Works NER approaches have been reported since the early 90s. One of the first works has been described by Rau in [Rau91]. This paper describes the idea of a system that extracts and recognizes company names. It relied on handcrafted rules and heuristics. Since NER is language dependent, many systems have been presented for different languages. In [DBG + 00] is described a NER system that recog- nizes named entities in texts written in Greek. This 1 http://nlp.stanford.edu/software/CRF-NER.html