Automatic semantic relation extraction from Portuguese texts Leonardo Sameshima Taba, Helena de Medeiros Caseli Federal University of S˜ ao Carlos Rod. Washington Lu´ ıs Km. 235 leonardo taba@dc.ufscar.br, helenacaseli@dc.ufscar.br Abstract Nowadays we are facing a growing demand for semantic knowledge in computational applications, particularly in Natural Language Processing (NLP). However, there aren’t sufficient human resources to produce that knowledge at the same rate of its demand. Considering the Portuguese language, which has few resources in the semantic area, the situation is even more alarming. Aiming to solve that problem, this work investigates how some semantic relations can be automatically extracted from Portuguese texts. The two main approaches investigated here are based on (i) textual patterns and (ii) machine learning algorithms. Thus, this work investigates how and to which extent these two approaches can be applied to the automatic extraction of seven binary semantic relations (is-a, part-of, location-of, effect-of, property-of, made-of and used-for) in Portuguese texts. The results indicate that machine learning, in particular Support Vector Machines, is a promising technique for the task, although textual patterns presented better results for the used-for relation. Keywords: Semantic relation extraction, Information extraction, Text mining 1. Introduction The usage and importance of semantic information in Nat- ural Language Processing (NLP) tasks is growing by the minute. However, the rate at which semantic information can be produced and analyzed by humans is much less than that which is needed by NLP applications. As one of the efforts that will hopefully help bridge that gap, this paper presents an automatic semantic relation extraction method using lexical-syntactic data. Semantic relation extraction is the task of finding semantic relations between terms in texts. There’s not a single formal definition for “semantic relation” and “term”. Therefore, in this paper, “semantic relation” stands for any relation, explicit or implicit, between terms on a semantic level. A “term” is a contiguous sequence of tokens, which in turn are defined as any sequence of characters separated by spaces. This work focuses on the Portuguese language, which still lacks high quality linguistic resources and tools, especially in the semantic level. Seven semantic relations are targeted: hyponymy (is-a), meronymy (part-of), locality (location- of), causality (effect-of), property-of (something has a cer- tain property), made-of (something is made of some ma- terial) and used-for (something is used for a certain end). These relations are a subset of the ones used in the Open Mind Common Sense (OMCS) project 1 and were cho- sen motivated by the needs of the Brazilian branch of the OMCS project 2 . In order to extract these seven relations automatically a textual pattern strategy and two supervised machine learning algorithms, C4.5 decision trees (Quinlan, 1993) and Support Vector Machines (Vapnik, 1995), were evaluated. 2. Related work There has been extensive work on the subject of seman- tic relation identification, mostly for the English language. The first researched approach was the textual patterns 1 http://openmind.media.mit.edu/ 2 http://www.sensocomum.ufscar.br/ paradigm, pioneered by Hearst (1992). In her paper, Hearst describes six textual patterns that indicate the presence of a hyponymy relation between two noun phrases. She also proposed an algorithm to find patterns that imply a seman- tic relation R. Hearst applied her patterns on encyclopedic and journalistic corpora and found that 63% of the identi- fied relations were of good quality. Berland and Charniak (1999) follow Hearst’s algorithm, but search for meronymy relations. Their results, obtained by applying the patterns on a 100 million words journal- istic corpus, show that, on average, 55% of the relations found were correct. Girju and Moldovan (2002) also fol- low Hearst’s algorithm, looking for causality relations on a journalistic corpus and reporting a 65% accuracy. Freitas and Quental’s (2007) work is one of the few that fo- cuses on the Portuguese language. They adapted Hearst’s patterns to Portuguese, creating 4 patterns that indicate hy- ponymy, and applied them to a corpus composed of around 2 million words of the public health domain. The results are compatible with Hearst’s, showing that 73% of the relations found were of high quality. Noticing the shortcomings of the textual patterns approach – namely, high precision but low recall – and encour- aged by the increasing abundance of available textual data, researchers turned to machine learning (ML) techniques which leverage large quantities of text in order to try to find semantic relations. The work of Girju et al. (2003) uses C4.5 decision trees (Quinlan, 1993) to extract part-whole relations from journalistic corpora. Using the same idea from Hearst’s al- gorithm, some of the meronym pairs from WordNet were searched for in these corpora and some patterns that may indicate the part-whole relation were derived, such as “NP (noun phrase) of PP (prepositional phrase)”, “NP’s PP” and “NP verb PP”. However, these patterns are very ambiguous as they can indicate relations other than meronymy. In order to solve that problem, Girju et al. (2003) propose learning semantic restrictions over the participants in the relation. The researchers reported 83% precision and 72% recall, re- 2739