Automatic Information Extraction in Semi-Structured Official Journals Valmir Mac´ ario Filho, Ricardo B. C. Prudˆ encio, Francisco A. T. De Carvalho Center of Informatics, Federal University of Pernambuco Av. Prof. Luiz Freire, s/n 50740-540 Recife/PE BRAZIL {vmf2,rbcp,fatc}@cin.ufpe.br Leandro R. Torres, Laerte Rodrigues J´ unior Capital Login R. da Guia, N 99 50.030-210 Recife/PE BRAZIL {leandror,laerter}@capitallogin.com.br Marcos G. Lima Department of Information Science, Federal University of Pernambuco Av. dos Reitores, s/n - CEP 50670-901 - Recife/PE BRAZIL galyndo@gmail.com Abstract Information extraction systems are used to extract only relevant text information in digital repositories. The cur- rent work proposes an automatic system to extract informa- tion in semi-structured official journals. In our approach, given an input document, a Machine Learning (ML) algo- rithm classifies the document’s fragments into class labels which correspond to the data fields to be extracted. The im- plemented system deployed different features sets and algo- rithms used in the classification of the fragments. The sys- tem was evaluated through experiments on a sample con- taining 22770 lines of the Pernambuco’s Official Journal. The experiments performed revealed, in general, good re- sults in terms of precision, which ranged from 70.14% to 98.63% depending on the feature set and algorithm used in the classification of the fragments. 1 Introduction A great amount of valuable information is stored in dig- ital repositories of textual documents [1]. A significant part of the information comprised in these repositories is only legible by humans, being hardly manipulated by com- puter machines. Hence, it is appropriate to develop systems which are capable to automatically extract information on these repositories in order to support specific users’ needs [2]. For instance, searching information in historic docu- ments, finding specific sections on a magazine and extract- ing publications from an official journal. Official journals are documents that contain publications (e.g., acts, texts of new laws, edicts, decisions) of countries, states, cities and other institutions in the different branches of Executive, Legislative and Judiciary power. Nowadays, these documents are becoming increasingly available in web sites as a new form of information service (e.g., the Official Journal of the European Union 1 and the Official Journal of the Federative Republic of Brazil 2 ). The task of finding specific information of interest in of- ficial journals is very difficult due to the great number of publications which are daily available. Although this task can be automated, it is possible to point out some difficulties with regard to this purpose: the lack of rigid models to orga- nize the publications in the documents, no clear delimiters between different publications, the presence of abbreviated words, the presence of orthographic errors, among others. Documents which present the above-cited characteristics are called semi-structured texts [1]. In order to manipulate such documents, an automatic system called Information Extraction (IE) system may be very suitable. IE systems are able to extract specific information of interest from a repos- itory of textual documents. Each input of an IE system is a textual document and the output is a set of text fragments which correspond to data fields required by the user. The extracted fields can be either directly presented to the user 1 http://eur-lex.europa.eu 2 http://portal.in.gov.br/imprensa